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Preface 


This book is one of the first undergraduate-level text- 
books to discuss computer architecture and assembly language programming for a 
modern 64-bit Reduced Instruction Set Computer (RISC). We chose the Alpha™ RISC 
processor made by Digital Equipment Corporation because it represents a thoroughly 
new design and was the first 64-bit RISC chip to be commercialized successfully. 
Moreover, the Alpha has been the principal RISC platform for Microsoft® Windows® 
NT®, and it is likely to be one of the first platforms to run a 64-bit implementation of 
Windows NT. 
In writing this book, we have brought together portions of our collective teaching 
experience at our respective institutions in order to continue the tradition of two suc- 
cessful predecessor books: 


Eckhouse, Richard H. and Robert Morris, Minicomputer Systems: Organiza- 
tion, Programming, and Applications (PDP-11). Englewood Cliffs, N.J.: 
Prentice-Hall, Inc., 1979. 


Levy, Henry M. and Richard H. Eckhouse, Computer Programming and 
Architecture: The VAX, 2nd ed. Newton, Mass.: Butterworth-Heinemann 
(Digital Press), 1989. 


That is, we discuss some of the general principles of computer architecture by guid- 
ing our readers through a pedagogically tested experience in register-level analysis 
and programming using one specific contemporary architecture. First, this was 
through the study of a popular 16-bit design (PDP-11™), then using a well-respected 


xiii 


OO: oa eee 


xiv Preface 


32-bit design (VAX™), and now by the choice of a pioneering 64-bit design (Alpha) 
as the illustration. 

We envision a diverse readership for this book, which may encompass the follow- 
ing: first, undergraduate or graduate classes in computer architecture and/or assembly 
language using it as the primary text; second, college and university classes on 
advanced topics in computer science using it as a supplement; third, computer profes- 
sionals using it for individual study and reference, especially those who want to build 
upon some previous familiarity with VAX systems. Although ready access to a system 
with Alpha assembler software would be optimal, we feel that persons lacking such 
access can still derive considerable benefit from this book. We have striven to keep both 
our discussions and many of the suggested exercises at a degree of transparency which 
can be worked through with pencil and paper, for we feel that a mature understanding of 
the complex or the subtle is best built on a foundation of confidence in the simple. 


Although we used the OpenVMS™ programming environment when we first 
developed drafts of this book to show proof of concept, Alpha systems have been suc- 
cessfully marketed for three different operating systems: Windows NT, Unix®, and 
OpenVMS. 


Our book is about the design and capabilities of Alpha architecture for program- 
mers, not about operating systems for their own sake. Nevertheless, some hands-on 
exposure to one programming environment or another is recommended to make sample 
programs come alive for the reader. Therefore we describe how to work within standard 
programming environments—principally Unix, contrasted with OpenVMS and Win- 
dows NT—wherever useful and pertinent to attainment of our paramount goals. 


In the course on computer architecture at Lawrence University, we introduce a 
very simple programming illustration as early as the second class meeting (out of 30) in 
a 10-week term. An additional brief demonstration program is presented as an illustra- 
tion during many subsequent class meetings, and student assignments frequently 
involve adaptations and extensions of such models. These illustrative programs are 
available as source text on a CD-ROM accompanying this book. 


Increasingly powerful techniques for input and output are introduced progres- 
sively throughout the course, beginning with simple debugging techniques and continu- 
ing up through the use of sequential files for both input and output. The vendor- 
supplied debugger is introduced as a versatile tool to be used by students routinely from 
an early point in the course. 

The chapters of this book contain more material than some instructors may wish 
to present in their courses, particularly if comparisons among multiple architectures or 
concepts of hardware organization are incorporated within the same course. Portions of 
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Chapter 6 on byte manipulations and all of Chapter 8 on floating-point operations can 
be omitted without loss of continuity. 

We know that some instructors would prefer to put a thorough treatment of proce- 
dure calls (Chapter 7) at an earlier point than we do, but this may be chiefly in order to 
accomplish input/output. In our approach, we prefer to use the debugger at first, and 
then to introduce a very simple use of procedures based on a high-level language for 
input/output. The details of procedure-calling mechanisms are only taken up in Chapter 
7 after much of the Alpha instruction set and several foundational topics (e.g., address- 
ing modes and stacks) have been discussed. 

Chapter 9 presents traditional material on the macro-processing capabilities of the 
OpenVMS assembler, and can be omitted without loss of continuity by readers who are 
following an emphasis on Unix. 

Chapter 10 continues to develop the principles of register-level programming 
through examples that draw upon some of the system-supplied support functions that 
the C language uses for input and output from text files. 

The final three chapters in this book take up advanced topics related to Alpha 
architecture, including an overview of extensions added in later chip implementations. 
We use the techniques of an experimentalist more than the perspective of a theorist 
when dealing with performance-related concerns. 

We have found that an exploration of the machine code produced by several high- 
level language compilers proves to be remarkably illuminating to students in the latter 
part of our course. For example, they are often quite taken aback to learn how much 
overhead is entailed by a high degree of proceduralization of a program. Such “eureka” 
experiences serve as rewards for their effort to comprehend assembly language and 
computer architecture. Chapter 12, in which we show how to investigate compiler out- 
put, represents a distinctive feature of this book. 

Our final chapter explores why a computer maker may decide to extend a well- 
defined architecture and not solely refine its performance through better implementa- 
tions. Chapter 13 also suggests how PALcode bridges between domains of pure soft- 
ware and pure hardware for Alpha systems. 
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CHAPTER 1 


Architecture and 
Implementation 


Coumi science frequently distinguishes between 
abstraction and implementation, i.e., between the general and the particular. We may 
examine any computer system at two major levels: its architecture and its organization. 
Although numerous books convey both of these levels in their titles and contents, we 
are going to concentrate on architecture in this book. Therefore we first direct our read- 
ers toward an understanding of the distinction between these levels. 


In the first decades of the history of computers, the sporadic emergence of new 
ideas and new companies kept resulting in a rather jumbled succession of disparate 
approaches to computer design. The design of the IBM 360 series by Amdahl and oth- 
ers, however, marked not only the trendsetting idea of a family line of computers but 
also a clear articulation of architecture apart from implementation: 


e The architecture of a computer system is the abstraction equivalent to the user- 
visible interface: the structure and the operation of the system as seen by the 
assembly language programmer and the compiler writer. If an architecture is well- 
designed, well-engineered to adapt to new technologies, and well-liked, it may 
persist for a decade or longer. 


e An implementation is the realization and construction of that interface and struc- 
ture out of specific hardware (and possibly software) components. Because of 
technological advances, any particular implementation (i.e., one model) may be 
actively marketed only for a relatively short time. 
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Thus several different implementations of an architecture may appear over a period of 
years. Each will offer different trade-offs between cost and performance or convenience, 
but each will present exactly the same interface to the assembly language programmer. 
Such consistency over time, despite technological change, has clearly helped computer 
system manufacturers to retain brand loyalty. 


Analogy: Piano Architecture 


Architecture applies to buildings, landscapes, computers, and even pianos. Let us 
briefly consider the architecture of pianos. The definition of piano architecture is the 
specification of the keyboard, as shown in Figure 1.la. The keyboard is the user 
(player) interface to this musical instrument. It consists of 88 keys: 36 black keys and 
52 white keys. Striking a key causes a note of specified frequency to be sounded. The 
size and the arrangement of the keys are identical for all modern piano keyboards. 
Therefore, anyone who can play the piano can play any piano. 





b. Piano implementations 


Figure 1.1 Architecture and implementation of the piano 





(5) 


Types of Computer Languages 


Many implementations of the piano architecture are possible, as shown in Figure 
1.1b. The implementation is concerned with the materials used to build the instrument. 
The kinds of wood and metal used, the selection of ivory or plastic keys, the shape of 
the instrument, and so forth, are all implementation decisions made by the piano 
builder. Regardless of the implementation decisions made, however, the final product 
can be played by any piano player. 

In a computer system, the architecture consists of the programming interface: the 
instruction set, the structure and addressing of memory, the control of input/output (1/0) 
devices, and so on. Several implementations of an architecture will be possible using 
different electronic design techniques that may have different size, cost, and perfor- 
mance characteristics. Yet a program that runs on one machine would run on all 
machines conforming to the same architecture. 


Types of Computer Languages 


The first computers had to be programmed by people who knew the detailed capabili- 
ties and limitations of the hardware. Memory cells were an especially precious 
resource, and great care and ingenuity were required to squeeze algorithms into the 
available space, almost bit by bit. Subsequently, as the overall capabilities of computers 
have improved and the potential for more widespread use has become evident, succes- 
sive generations of computer languages have been devised in order to improve program- 
mer productivity and accuracy, as shown in Table 1.1, where n GL means n th 
generation language. 


Table 1.1 Generations of Computer Languages 
Generation Description Attributes and Examples 


1GL Machine Each instruction speaks directly to the hardware level of the par- 
language ticular architecture. Instructions are numeric (i.e., patterns of Os 

and 1s), but those can be made partially comprehensible by clus- 

tering adjacent bits together using an appropriate choice of base: 


e decimal (base 10), for the IBM 1620 
e octal (base 8), for the PDP-11 


e hexadecimal (base 16), for the VAX, the Alpha, and most other 
current architectures. 


2GL Assembly Each instruction is mnemonic, e.g. ADD, but stands in one-to-one 
language correspondence with machine instructions. Additional directives 
to the assembler program help with storage allocation and pro- 
gram segmentation. 
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Table 1.1 Generations of Computer Languages (continued) 








Generation Description Attributes and Examples 
3GL High-level Statements in an arbitrarily defined artificial programming lan- 
languages guage are translated by a compiler program into appropriate 


sequences of machine instructions. Examples include COBOL, 
FORTRAN, PL/I, BASIC, C, Pascal, Ada. 


4GL Object- Newer types of computer languages include: 
oriented e artificial intelligence languages (e.g., LISP) 
languages 


e database acccess languages (e.g., SQL) 
e natural-language query languages 
e object-oriented languages (e.g., C++, Java, or Smalltalk). 





Assembly language is situated between machine language and higher-level lan- 
guages such as C or Pascal. Assembly language more precisely expresses the con- 
straints of an architecture than high-level programming languages, but the latter are 
more amenable to re-use of code segments and global optimizations. Accordingly, 
learning at least one assembly language can lead to an appreciation of what program- 
ming languages may actually do “underneath.” 


Why Study Assembly Language? 


The convenience and greater portability of high-level languages raise the very real 
question of why anyone should study assembly language. Does the answer entail some 
arcane rite of initiation for the truly computer-savvy? Is the primary motivation “to see 
how a computer really works”? 


In a purely intellectual sense, a valid answer would include the personal attain- 
ment of a greater appreciation of at least one computer architecture in some depth, in 
order to comprehend its strengths and weaknesses without any filtering through other 
software levels. Moreover, in a pragmatic sense, assembly language still today may 
make possible the accomplishment of desirable outcomes that are not easily achievable 
otherwise, including: 


e the fastest attainable execution speed; 
e the least memory usage; 


e very specialized data manipulation, thereby compensating for features lacking in a 
programming language; and 
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e specialized device control, such as adding a device driver not regularly furnished 
with the operating system. 


But, you may object, hardware improvements over time certainly bring ever-faster exe- 
cution speeds. And the ever-greater storage densities of memory technology would 
seem to obviate most concerns about program size. Yet occasions do still arise when the 
tightest, fastest code possible has significant pragmatic incentive. Good software engi- 
neering teams know when to write a little assembly language, even as they use 3GL and 
4GL tools predominantly. : 

To be sure, assembly language is relatively hard to write, debug, and maintain. 
Most severely, assembly language lacks transportability from one architecture to 
another and thus suffers significantly in comparison to high-level languages. Hyde 
(1996) has discussed these and other objections to assembly language, but has adduced 
further positive reasons why learning assembly language deserves the attention of aspir- 
ing and practicing computer scientists. In particular, he has stated that good assembly 
language programmers make better high-level language programmers “because they 
understand the limitations of the compiler and they know what it’s doing with their 
code.” We will return to this latter concept in Chapter 12. 

In the Digital Unix programming environment, the system program that understands 
the Alpha instruction set in symbolic assembly language is just called “the assembler.” In 
the OpenVMS programming environment, the corresponding system program is called 
the MACRO-64 assembler. The word “macro” was chosen because the assembler has 
powerful text-substitution capabilities that are introduced in Chapter 9. 

In this book, we present a generic form of Alpha assembly language which needs 
as few adjustments as possible to be acceptable to the assembler programs for the Unix, 
OpenVMS, and Windows NT programming environments. 


The Alpha and its Ancestors in a Superfamily 


The design of the IBM 360 strongly influenced many subsequent computer architec- 
tures, including 16-bit minicomputers brought forth by numerous manufacturers during 
the 1970s. Quite possibly the most successful among those new designs, the PDP-11 
made by Digital Equipment Corporation, not only persisted through about a dozen 
implementations over more than two decades, but also has come to be seen as the pro- 
genitor of a thriving superfamily of computers by the same company. 

As memory technologies have improved, the push toward convenient addressabil- 
ity of increasingly large amounts of memory has driven Digital and other manufacturers 
to redesign their computers with successively greater widths for the internal registers 
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and pathways where addresses as well as data are manipulated. By 1990 this trend led 
to 32 bits as the prevailing standard register width for even the smallest computers, and 
then to 64 bits as an emerging standard for mid-range and larger computers. 


Some of the attributes of the PDP-11, VAX, and Alpha product lines are summa- 
rized in Table 1.2. The VAX is frequently cited, along with the Intel 80x86/Pentium 
series, as the exemplar of a Complex Instruction Set Computer (CISC). The Alpha was 
the first major 64-bit example of a Reduced Instruction Set Computer (RISC) to attain 
wide commercial deployment. Several other manufacturers had already brought out 
successful 32-bit RISC designs, but Digital Equipment Corporation opted to make the 
move to 64 bits simultaneously with the move from CISC to RISC. We discuss some of 


the other information in Table 1.2 in later sections of this book. 


Table 1.2 
Corporation 





Complexity classification 
Number of integer registers 
Integer register width 
Instruction size 

Number of instruction styles 
Number of opcodes 

Number of operands 
Allowed memory access 
Number of addressing modes 
Number of integer data types 
Number of floating data types 
Byte ordering 


Unidirectional branch range 
Logical address space 
Input/output strategy 
Principal operating systems 


Lifetime as a marketed product 


PDP-11 
classic mini 
8 
16 bits (2 bytes) 
2, 4, 6 bytes 
6 
> 100 
0, 1,2 
many instructions 
8 
2 
2 
little-endian 


255 bytes 
64 KB 
memory-mapped 


Unix (several), 
RSX (several), 
RSTS/E, RT-11 


1971 - 1995 


VAX 
CISC 
16 
32 bits (4 bytes) 
1 — 37 bytes 
byte stream 
> 256 
0-6 
many instructions 
12 
5 
4 
little-endian 


127 bytes 
4 GB 
memory-mapped 


Unix (several) 
OpenVMS 


1978 — 


Comparison of Some Computer Architectures by Digital Equipment 


Alpha 
RISC 
32 
64 bits (8 bytes) 
4 bytes 
7 
> 100 
0-3 
only load/store 
2 
2 
5 
little-endian 
(optional big- 
endian) 
4 MB 
16 EB* 
memory-mapped 
Unix (64-bit), 
OpenVMS (64- 
bit), Windows NT 
(32-bit) 
1992 — 


* E (exa-) is , strictly, the prefix for 1018 in the SI system of units, but here it is used for an approximation of 260. 
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SQUARES: A Comparative Programming Example 


Beginnings are hard. Just think back, if you can, to the very first program that you wrote 
in any computer language. It was probably oversimplified, and the hardest task may 
have been getting data in or out. In this book, each illustrative example in assembly lan- 
guage is kept as simple as possible in order to help you focus your reading and study on 
the heart of the matter at hand, though you must attend to all the small details also. 


We present in this section a very simple but complete program. Results from its 
execution will be verified using a feature of the Digital Command Language (DCL) in 
the OpenVMS programming environment. You should be able to follow this illustration 
readily even if you do not have access to an OpenVMS system with the MACRO-64 
assembler installed. Later, we will come back to this simple program and modify it for 
different means of output for the Unix programming environment. 

Statement of the problem: Write an Alpha assembly language program that will 
produce in memory a table of the squares of the first 3 integers without using any multi- 
plication instruction. 

Presentation of the algorithm: We begin by writing down the first several integers, 
N, then their squares, N2, and finally the first and second tabular differences: 


1st tabular 2nd tabular 

N N2 difference difference 
1 l 

3 
2 4 2 

5 
3 9 2 

7 
4 16 2 

9 


5 29 


Successive values of the first tabular difference can be computed by adding the constant 
second tabular difference each time. Successive values of the squares can be computed 
by adding the appropriate value of the first difference to the already known previous 
square. This enumeration can be readily expressed in a standard 3GL programming lan- 
guage, for example, in Pascal: 
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program SQUARES (output); 
var sql, sq2, sq3 : [global] integer; 
var temp, diffl, diff2 : integer; 


begin 
aarti. <= 13 
diff2 s= 29 
temp := 1; 
sql := temp; 
Giffl. os GQLit2a + diff; 
temp := diffl + temp; 
sq2 := temp; 
ALLEL s= a@itt2 + diffi; 
temp := diffl + temp; 
sq3 := temp; 
writeln (sql, sq2, sq3) 

end. 

or in ANSI C: 


#include <stdio.h> 
main () 
{ 
long int sql, sq2, sq3; 
long int temp, diffl, dadift2; 


mitra. = Ihe 
Giftti2 = Bis 
temp = 1L; 

sql = temp; 


ALEF = diffs + @i££i- 

temp = diffl + temp; 

sq2 = temp; 

aiil = Giti2 + diffi; 

temp = diff1 + temp; 

sq3 = temp; 

printf ("Sd\t%d\t%d\n", sql, sq2, sq3); 


It should be evident that the pattern of first adjusting diff1, then using diff1 to 
adjust temp, and finally storing temp as the next square could be iterated any desired 
number of times to compute sq@4, etc. 

Now let us transform the expression of this algorithm from a 3GL implementation 
into a 2GL equivalent in Alpha assembly language. The 32 integer registers in the 
Alpha processor can each hold one 64-bit integer (Table 1.2) and are named RO, R1, R2, 
..., R31 in the OpenVMS programming environment. Later we will learn that a few of 
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these have specially designated roles, but for now we assign the first three to correspond 
to temp, diff1, and diff2 in the Pascal expression of the algorithm. We intention- 
ally gloss over, for now, the use of R15 and R27 for setting up an addressing context for 
memory locations that correspond to sq1, sq2, and sq3 in the Pascal version. 

The algorithm in Alpha assembly language appears in the top third of Figure 1.2. 
All the elements of Alpha assembly language used here will be explained more fully in 
later chapters. For the present, it is enough to appreciate that the three central columns 
make up the actual program instructions. The phrases in mixed case to the right of a 
semicolon are explanatory comments that annotate the programmer’s intended relation- 
ship between those instructions and the algorithm. The leftmost column shows the suc- 
cession of 32-bit Alpha instructions produced by the assembler expressed in 
hexadecimal (base 16) notation. (A review of number systems and notation is presented 
in the next section of this chapter.) 


1 .TITLE SQUARES Table of Squares (Alpha) 
2 SROUTINE SQUARES, DATA_SECTION_POINTER=TRUE, - 
3 KIND=STACK, SAVED_REGS=<R2> 
0018 2130 SDATA_SECTION 
00000008 0000 2131 SQ1:: .BLKQ 1 ; To store 1 squared 
00000010 0008 2132 SQ2:: .BLKQ al ; To store 2 squared 
00000018 0010 2133 SQ3:: .BLKQ T ; To store 3 squared 
0018 2134 7 ete. 
0018 2135 SCODE_SECTION 
00000008'0018 2136 . BASE R27,S$SLS ; R27 -> linkage section 
A5FBFFF8 0018 2137 LDQ R15, $DP ; R15 -> data section 
00000000'001C 2138 . BASE R15,$DS ; Tell MACRO about this 
47E03401 001C 2139 MOV #1,R1 ; R1 = first difference 
47E05402 0020 2140 MOV #2,R2 ; R2 = second difference 
47E03400 0024 2141 MOV #1,R0 ; RO = first square 
B40F0000 0028 2142 STQ R0, SQ1 ; to be stored 
40410401 002C 2143 ADDQ R2,R1,R1 ; Adjust first difference 
40200400 0030 2144 ADDQ R1,R0,R0 ; RO = second square 
B40F0008 0034 2145 STQ RO, SQ2 ; to be stored 
40410401 0038 2146 ADDQ R2,R1,R1 ; Adjust first difference 
40200400 003C 2147 ADDQ R1,R0,RO ; RO = third square 
B40F0010 0040 2148 STO R0, SQ3 ; to be stored 


0044 2149 ; etc. 
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0000 1 . TITLE SQUARES Table of Squares (VAX) 
00000004 0000 2 SQOL¢: BERL f. ; To store 1 squared 
00000008 0004 3 SQ2:: .BLKL 1 ; To store 2 squared 
0000000C 0008 4 SQ3:: .BLKL 1 ; To store 3 squared 
000C 5 f ‘ete. 
0000 000C 6 -ENTRY  SQUARES,0; Define entry point 
51 01 DO 000E 7 MOVL #1,R1 ; Ri = first difference 
52 02 DO 0011 8 MOVL #2,R2 ; R2 = second difference 
50 01 DO 0014 9 MOVL #1,R0 ; RO = first square 
E5 AF 50 DO 0017 10 MOVL RO, SQL ; to be stored 
51 52 cCOo 001B iA N ADDL R2, Ri ; Adjust first difference 
50 51 CO OO1E 12 ADDL R1,R0 ; RO = second square 
DF AF 50 DO 0021 t3 MOVL R0, SQ2 ; to be stored 
51 52 CO 0025 14 ADDL R2, Ri ; Adjust first difference 
50 51 CO 0028 Lo ADDL R1,R0 ; RO = third square 
D9 AF 50 DO 002B 16 MOVL RO, SQ3 ; to be stored 
002F ar > eté; 
FD AF Ly . 0OZEF 18 LOOP: : JMPLOOP ; Hopeless loop! 
0032 19 SEXIT_S ; Here's a normal exit 
003B 20 . END SQUARES ; Set start address 
4. “TITLE SQUARES Table of Squares (PDP-11) 
2 000000 SQ1:: .BLKW L ; To store 1 squared 
3 000002 SQ2:: .BLKW 1 ; To store 2 squared 
4 000004 SQ3:: .BLKW 1 ; To store 3 squared 
5 > etC, 
6 .MCALL EXIT$S ; Include a system routine 
7 000006 SQUARES: : 
8 000006 012701 000001 MOV #1,R1 ; R1 = first difference 
9 000012 012702 000002 MOV #2,R2 ; R2 = second difference 
10 000016 012700 000001 MOV #1,R0 ; RO = first square 
Li 000022 010067 177752 MOV RO,SQ1 ; to be stored 
12 000026 060201 ADD R2,R1 ; Adjust first difference 
L3 000030 060100 ADD R1,R0 ; RO = second square 
14 000032 010067 177744 MOV RO, SQ2 ; to be stored 
15 000036 060201 ADD R2,R1 ; Adjust first difference 
16 000040 060100 ADD R1,R0 ; RO = third square 
LY 000042 010067 177736 MOV RO, SQ3 ; to be stored 
18 ; etc. 
19 000046 000167 177774 LOOP: : JMP LOOP ; Hopeless loop! 
20 000052 EXITSS ; Here's a normal exit 
21 000006' . END SQUARES ; Set start address 





Figure 1.2 Versions of the SQUARES program for three Digital architectures 


Caution: This program contains an intentional infinite loop. When we run it, we 
should issue a Control-Y almost immediately to interrupt it. OpenVMS will keep the 


program and, most importantly, the computed data values in the current memory parti- 


tion where the examine command can access them. (Running any other program, 
such as a directory command, would wipe out the computed data.) 


Let us assume that we have the text file of the program, squares .m64, in our 
current default directory. The following DCL commands will assemble that program, 
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link the intermediate object file into an executable format, run the program, and exam- 
ine the results: 


$ macro/alpha/list squares 
S link/map squares 

$ run squares 

<Control-Y> 

$ examine/hexadecimal 30000 
00030000: 00000001 

$ examine/hexadecimal 30008 
00030008: 00000004 

$ examine/hexadecimal 30010 
00030010: 00000009 


But how did we know which memory locations to examine? We asked the linker to pro- 
duce a map file, which we could type out to see what numeric addresses were assigned 
to our computed results (sql, sq2, sq3). We also asked the assembler to produce a 
listing file, a portion of which has been excerpted, with some adjustments in spaces and 
tabs, to produce the top third of Figure 1.2. You will learn more about map and listing 
files in Chapter 3. 

Figure 1.2 illustrates the family resemblances in assembly language for the Digital 
Alpha (top third) and its VAX and PDP-11 predecessors. The middle third of Figure 1.2 
displays a corresponding program for the SQUARES algorithm for the VAX, written in 
VAX assembly language. Similarly, the bottom third of Figure 1.2 displays a version of 
SQUARES for the PDP-11, written in PDP-11 assembly language. We have chosen the 
most “natural” integer sizes for these three architectures: 64 bits (which, for historical 
reasons, Digital calls a quadword) for the Alpha, 32 bits (which, for historical reasons, 
Digital calls a longword) for the VAX, and 16 bits (which Digital calls a word) for the 
PDP-11. Correspondingly, we have used the appropriate quadword variants of Alpha 
instructions (final character Q) and longword variants of VAX instructions (final char- 
acter L). 

We can also readily perceive a principal drawback of assembly language program- 
ming by comparing these versions of SQUARES. Virtually every line has had to be 
edited in order to produce one version from another, despite the strong homologies 
present in this illustration. In a more general case, the differences would be even greater 
because it would not be possible to find corresponding instruction types all the way 
through. In addition, there are differences in how the program must be “framed” with 
beginning and ending directives that are required for the different operating system 
environments. Figure 1.2 was prepared using assemblers for OpenVMS for Alpha, 
OpenVMS for VAX, and RSX-11M/M-PLUS for PDP-11. 
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Review of Number Systems 


The various columns of numbers in Figure 1.2 are not all expressed using the same 
base. Some are decimal (base 10) while others are hexadecimal (base 16) for the Alpha 
and VAX versions; some are octal (base 8) for the PDP-11 version. Throughout this 
book we will have to use decimal, binary, and hexadecimal number systems. Therefore 
a concise review of number systems and representations for integer data will conclude 
this introductory chapter. You should skip this material only if you are already adept 
with conversions among these representations, including the expression of negative 
integer values. 

All data stored and manipulated in contemporary computers are represented in 
binary form. Integers, floating-point numbers, characters, and instructions are all repre- 
sented as sequences of binary zeros and ones. This binary representation is called base 
2. We defer discussion of the encoding of floating-point numbers, characters, and 
instructions. Here we first discuss only integer data. 

When we specify input or when we display output of pure binary data, we usually 
use base 8 (octal), base 10 (decimal), or base 16 (hexadecimal) instead of base 2 (pure 
binary). These larger bases are easier for humans to comprehend than long strings of 
binary digits; however, the value of a stored chunk of numeric information is the same 
regardless of the base used to represent it. Different bases are suitable for different 
applications. Bases 8 and 16 are particularly useful for emphasizing patterns of bits 
within a stored unit, while base 10 is useful for understanding the everyday value of a 
stored number, since it is most closely related to counting in natural human languages. 


Positional Coefficients and Weights 


The most frequently encountered numbering system involves positional coefficients and 
weights. The digit value at each position in a number is its positional coefficient. The 
weight of each digit is a successively larger power of the base of the number system, going 
from right to left. We can express a value Q in the number system with base (also called 
radix) r, as follows, assuming that there may possibly be a fractional part {in braces}: 
Q = XnWn + Xn-1Wn-1 + ... + X1Ww1 + XoWo + {xX-1W-1 +... + X-mW-m} 
where 
Wj = ri 
that is, r Í = weight and r = radix or base, and 
O<xj<r-1 


and x; = positional coefficient. 
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This formalism ensures that the largest value of a positional coefficient is always 
one less than the base value, i.e., 1 for base 2 (binary), 9 for base 10 (decimal), and so 
forth. When the base is greater than 10, the letters A, B, ... can be used to convey the 
positional coefficients with values of 10 and beyond. Base 16 (hexadecimal) uses the 
letters A through F to represent numerical values 10 through 15, respectively. 

We will illustrate these concepts by expressing the number 140 in the bases 2, 8, 
10, and 16, as follows: 


140 =1x107+4x 10! +0x 109 (base 10) 
=1x274+0x2+0x2>+0x244+1x234+1x27+0x2!40x20 (base 2) 
=2x82+1x8l44x 90 (base 8) 
=8x164+Cx 169 (base 16) 


That is, the value 140 results from summing the non-zero terms: 


140 =100+40 (base 10) 
= 128+8+4 (base 2) 
= 128+8+4 (base 8) 
= 128+ 12 (base 16) 


Summary: 14010 = 100011002 = 214g = 8C16. 


Binary and Hexadecimal Representations 


Modern digital computers use the binary number system internally because the most 
practical physical components are intrinsically binary in nature. Since long strings of Os 
and 1s are cumbersome for human beings, most computer professionals routinely use 
base 8 or base 16 representations instead. The various system components—assem- 
blers, compilers, linkers, and so forth—readily convert such numbers to binary equiva- 
lents for us. Table 1.3 shows the binary, octal, and hexadecimal equivalents for the 
decimal values 0 through 16. 
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Table 1.3 Conversion Table for Integers 


Decimal Binary Octal Hexadecimal 
0 0000 0 0 
1 0001 l 1 
2 0010 2 2 
3 0011 3 3 
4 0100 4 4 
2 0101 a = 
6 0110 6 6 
7 0111 fi 7 
8 1000 10 8 
9 1001 11 9 
10 1010 12 A 
11 1011 13 B 
12 1100 14 C 
is 1101 bs D 
14 1110 16 E 
15 FLI 17 F 
16 1 0000 20 10 


Base 8 and 16 representations are not only convenient but are also easily derived 
from binary representations. Conversion simply requires the separation of the binary 
number into 3-bit (for octal) or 4-bit (for hexadecimal) groups, from right to left, and 
the replacement of each binary group with the appropriate digit for the new base. Con- 
sider the following illustration: 


0100111000012 (base 2) 
010 011 100 0012 = 23418 (base 8) 
0100 1110 00012 = 4El16 (base 16) 


Although this process is relatively intuitive, you may already appreciate that some 
pocket calculators also have the capability of converting numbers amongst these com- 
mon bases. 
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The system software products for the assembly language programmer may 
express their outputs—program listings, linker maps, debugging aids, and dumps of 
memory contents—in octal or hexadecimal by default; decimal conversion is some- 
times offered also. There are times, however, when manual interpretation is still 
required because several data elements of different bit widths have been packed for 
storage as one composite binary number. 


Signed Integers 


an 


In the history of computer development, three methods have been considered for repre- 
senting ranges that include negative as well as positive integers; these methods are 
called sign and magnitude, one’s complement, and two’s complement. Sign and magni- 
tude form requires greater complexity at the physical implementation level, while one’s 
complement form introduces the complication of having two bit patterns that both rep- 
resent the number zero. Since testing for zero is a common operation, the need to con- 
sider two cases would require greater complexity at the physical implementation level. 
Accordingly, contemporary computers use two’s complement representation for signed 
integers. 

All methods for representing signed integers within N bits have the net effect of 
allocating one bit to represent positive/negative and N — 1 bits to represent the number’s 
magnitude. For two’s complement, zero is considered to be a positive number. Two’s 
complement representation offers the advantage of making successive additions or sub- 
tractions of 1 work smoothly right through zero. Consider the eight signed numbers that 
can be represented by 3 bits: 


+3 011 
+2 010 
+1 001 

0 000 
—] 11] 
—2 110 
—3 101 
—4 100 


Notice that a complete binary counting sequence has been “cut” into two halves and 
rearranged without further shuffling, much like “cutting” a deck of playing cards. Also 
appreciate that the range of integers that can be represented within N bits extends from 
the value —2^-!1 through 0 to +24-! — 1. 
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No special action is required to form the two’s complement representation for a 
positive integer, except to realize that the most significant bit must be an explicit zero. 
Forming the two’s complement of a negative integer can be accomplished by subtract- 
ing the magnitude from zero, for example: 


0 000 
(+3) -(011) (subtraction, with due consideration for “borrowings”’) 
-3 101 . 


Another method is to perform a bit-by-bit complementation of the positive number 
(including its zero sign bit) and then to add 1 to that intermediate result: 


+3 011 becomes 100 (intermediate result) 
+(001) (addition, with due consideration for “carries”’) 
-3 101 


In either of these methods, any borrowing or carrying outside of the N-bit field is not to 
be written as part of the final result. 

Finding the hexadecimal representation of a negative integer proceeds similarly by 
mentally subtracting each digit from “F” and then adding 1 to that intermediate result: 


+15502 3C8E becomes C371 (intermediate result) 
+(0001) (addition, with due consideration for “carries”) 
—15502 C372 


Fortunately, as with all such techniques, familiarity comes with practice. 

You will find that an understanding of hexadecimal and binary number systems is 
helpful to your study of architecture and assembly language programming, as well as 
other computer concepts. 
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EXERCISES 


1.1 Describe the architecture of a bicycle. What parts of the bicycle are not part of the 
architecture? What are some implementation differences among different bicycles? 


1.2 Consider the car rental industry. What aspects of automobile architecture are essen- 
tial to the ability of any driver to operate a randomly allocated rental car? What are 
some implementation differences among different automobiles that are not especially 
relevant to either the driver or the rental agency? 


1.3 Explain why harpsichords, organs, and accordions are not exemplars of “piano archi- 
tecture.” 


1.4 Why do you think that one computer manufacturer builds computers with a different 
architecture from those of another manufacturer? What does this imply about the 
importance of well-standardized high-level languages? 


1.5 What effect do you think the standardization of operating systems or command lan- 
guages might have on the future development of computer architectures? 


1.6 Write a paragraph that discusses the architecture/implementation distinction in the 
light of the well-known success of Apple Computer, Inc. in making the operating 
system for Power Macintoshes (based on the IBM/Motorola PowerPC RISC chips) 
in such a way that it could run virtually all pre-existing Macintosh software applica- 
tions (containing exclusively instructions for the Motorola 680x0 series CISC chips). 


1.7 Figure out, from internal evidence in the columns of numbers in all three portions of 
Figure 1.2, which columns are octal, decimal, and hexadecimal. 


1.8 Adapt the program SQUARES to compute the cubes of the first five integers without 
using any explicit multiplication. Hint: An algorithm for N? can be discovered by 
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1.9 
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writing down the series 1, 8, 27, 64, ... and then inspecting the pattern of first, sec- 
ond, and third tabular differences. (This is only a pencil-and-paper exercise unless 
you have access to the MACRO-64 assembler in the OpenVMS programming envi- 
ronment; if you do have such access and wish to execute your program, use the Con- 
trol-Y technique to halt your program and then use the DCL examine command to 
print out the computed results.) 


(OpenVMS programming environment only.) How many hexadecimal digits are 
required for a quadword number? How many digits were shown by the examine 
command for the sample run of the SQUARES program on an Alpha? How can these 
observations be rationalized? Hint: Issue the OpenVMS DCL command help 
examine and think about what’s (not) shown. (This particular exercise is repeated 
at the end of Chapter 2 because you may later have additional perspective and be able 
to give a fuller answer.) 


1.10 Digital’s first highly successful minicomputer, the PDP-8, was a 12-bit machine. 


What range of integers can be represented in a 12-bit unsigned binary number? In a 
12-bit two’s complement signed number? 


1.11 Convert 1010101019 into hexadecimal and octal. Convert 10A3416 into octal and 


decimal. Negate both values using 32-bit two’s-complement hexadecimal form. 


1.12 Complete the following table by converting the given number in each row into the 


other bases. 
Decimal Binary Octal Hexadecimal 
i — o o T 


1.13 Perform the following hexadecimal arithmetic: 


a. 205-6 
b. AF9 +9 
c. 10 * 10 (use “long” multiplication, and think carefully about the “carries”) 


d. 1CFF + F2FF 


1.14 Write a high-level language program which inputs a decimal number and a radix and 


reformats that decimal number in the specified radix. For example, if the inputs are 
10 and 16, the output should be A. Limit yourself to bases less than or equal to 16. 


CHAPTER 2 


Computer Structures 
and the Alpha 


When you develop applications in a high-level lan- 
guage such as C or Pascal, you have to understand the features, capabilities, and limita- 
tions of the programming language more thoroughly than you need to know the nature 
of the computer architecture on which that language has been implemented. In other 
words, the computer appears to be a machine that executes C or Pascal statements and 
manipulates high-level data elements for you. While standard Pascal provides for the 
storage and manipulation of Boolean variables, for example, the computer itself may 
not; in that case, the Pascal compiler will implement Boolean variables using another 
intrinsic data type which the computer architecture does support. 


The actual structure of the underlying hardware is thus virtually invisible to the 
high-level language programmer. Not so for the assembly language programmer, who 
gets a much closer feel for an actual machine and needs to understand in general the 
structure of a computer system, the nature of memory addressing, and the process of 
program execution. Once you have attained this foundation, you can begin to focus on 
learning instructions and elementary assembly language programming for a particular 
architecture. 


Computer structures 


Nearly all general-purpose computers share the same overall structure, consisting of 
three essential components that are visible to the assembly language programmer: 
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e the central processing unit (the CPU), which contains local storage registers, a 
data manipulation unit that performs arithmetic and logic operations, and essential 
control circuitry; 


e the high-speed primary memory, which holds both instructions and data; 


e the peripheral input and output devices (the I/O system), which enable interaction 
with humans or with other devices. 


These components must be interconnected in some fashion, as indicated schematically 
in Figure 2.1. In a typical implementation of any architecture, the interconnections take 
the form of a bus, or several buses, to convey data and control signals. Just think of a 
bus as a band of parallel wires, with each wire dedicated to carrying one bit of data or 
one particular control signal. Greater width of bus structures tends to correlate with 
increases of both cost and performance. (The choice of the Intel 8088, which has a 16- 
bit CPU but only an 8-bit data bus, instead of the 8086, which has a 16-bit data bus and 
which was also available at the time, lessened the cost of the first IBM PC design but 
limited its performance as well.) 


Central 
Processing Unit 
Input and Output 


Figure 2.1 Basic computer structure 





The Central Processing Unit 


The central processing unit (CPU) is the brain of the computer, and its activities are 
paced by a high-speed heartbeat (clocking signals). The CPU can fetch instructions and 
data from, and store data into, chosen memory cells. The CPU executes the fetched 
instructions, performing whatever arithmetic and logic operations are specified by those 
instructions. The CPU can manipulate addresses, as well as data. In addition, the CPU 
can redirect program flow along different paths according to the values of computed 
results. Finally, the CPU can execute instructions that initiate input and output opera- 
tions on peripheral devices. 
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Internal to the CPU are a number of registers that provide local, high-speed stor- 
age for the processor. These registers can hold either data or memory addresses. Some 
computers have general-purpose registers that can be used for any function. Others have 
distinct sets of registers to perform different functions. The Alpha architecture specifies 
32 integer registers, RO ... R31, that are used for addresses or integer data and 32 float- 
ing-point registers, FO ... F31. Such CPU registers and their width in bits are visible to 
the assembly language programmer (see Figure 1.2). Indeed, assembly language pro- 
gramming is distinguished from high-level language programming because it character- 
istically involves register-level programming, while the CPU registers are not directly 
accessible to a high-level language programmer. 

The program counter (PC) register always contains the address of the next instruc- 
tion to be fetched. The PC registers of the VAX and of the PDP-11 are visible to, and 
can be directly modified by, the assembly language programmer. In contrast, the pro- 
gram counter of the Alpha cannot be modified directly through the Alpha instruction 
set, except by branches and jumps. 


The Memory 


The memory of a computer serves as its main information repository. Almost from the 
beginning of the development of computer architectures, a single memory structure has 
generally been used to hold both instructions and data as proposed by John von Neu- 
mann. Over time, several types of physical components have been employed as memory 
elements. Nearly all of these—such as magnetic core and semiconductor circuit ele- 
ments (flip-flops and latches)—share the two-state property of a two-sided coin: each 
fundamental unit of memory storage can represent one binary digit, or bit. 

Bits in memory are arranged as an array of information units, with each informa- 
tion unit composed of a fixed number of bits, as shown in Figure 2.2. All memories 
share two organizational features: 


e Each information unit is the same size. 
e An information unit has a numbered address associated with it, by which it can be 
uniquely referenced. 


At the architectural level, it is essential to distinguish between two aspects associated 
with each information unit: 


e its address, which is its relative position in the entire memory structure; and 
e its contents, which is whatever binary bit pattern may be physically stored at that 
particular location in memory currently. 
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The address of a particular information unit never changes, but its contents may be 
highly volatile. 


Memory units (P bits) Address 


Bit P-1 BitP-2 eee Bit 2 Bit 1 Bit 0 


Memory size 


(N units) 





Figure 2.2 Memory organization 


Another view of memory, as Figure 2.3 shows, is that of a large one-dimensional 
array, M(i), where each element of the array contains one unit of information. The 
index, i, is the address of a unit; using that address we can locate the contents of the par- 
ticular unit, M(i). 


Address 
(Index) Contents 
:0 M (0) 
M (1) 
i M (À 
: N- 1 M(N-1) 





Figure 2.3 Memory as an array 


In the past, different computer architectures specified various values for the 
word size, P, the number of bits for each information unit. A large word size was 
thought to facilitate scientific calculations. Then the first minicomputers specified 
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word sizes as small as one byte (8 bits) for economy of design. Nearly all contempo- 
rary machine designs take another approach. At the architectural level, the memory is 
byte addressable. At the implementation level, the actual smallest amount of informa- 
tion stored or retrieved may be some multiple of a byte. On the PDP-11 and the VAX, 
the assembly language programmer has control down to the byte level. On the first 
implementations of the Alpha, the smallest amount of information that could be 
directly loaded into a register from memory, or stored from a register into memory, 
was one longword (4 bytes). Nevertheless, Alpha memory addresses are enumerated 
just as though the fundamental information unit were one byte in size. As we explain 
later in Chapter 6, the Alpha instruction set provides for access to the byte-sized sub- 
divisions of longwords and quadwords once these larger units have been brought 
from memory into CPU registers. 

The address space is the set of all addresses, or the collection of all distinct 
information units that a program can reference. The number of bits used to represent 
an address, N, determines the size of the address space, 2" (see Table 1.2). Unless a 
computer system design provides for special hardware elements for extended 
addressing, an address is usually less than or equal to the word size for a given archi- 
tecture. The era of the 16-bit minicomputers came to an end because programmers 
had difficulty accommodating large data structures or solving complex problems 
using only 64 kilobytes (65,536 unique memory locations). For a while, the 4- 
gigabyte address space of 32-bit architectures seemed ample for nearly all applica- 
tions. Digital’s “very large memory” (VLM64) servers appeal to Oracle, Sybase, and 
others for their highest performance database technologies because of the huge 16- 
exabyte address space of the Alpha architecture. 

As CPU speeds have increased faster than memory access speeds, system 
designers have had to introduce cache schemes in order to produce high-performance 
implementations of contemporary computer architectures. A cache subsystem 1s 
made from faster, but more expensive, memory elements; it holds copies of instruc- 
tions or data from frequently accessed memory locations. In some architectures, a 
cache memory is an implementation feature that is guaranteed to be invisible to an 
assembly language programmer (except indirectly through the perception of 
improved throughput). In other architectures, such as the PowerPC, the cache mem- 
ory is an architectural element that the system programmer can control using the 
machine’s instruction set. 


The Input/Output System 


Various types of input/output (I/O) devices allow the processor to communicate with 
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humans, with other processors, and with secondary storage devices that are slower than 
main memory but larger in capacity. These devices external to the central processor 
include terminals, printers, network interfaces, magnetic disks, and magnetic tapes. 
Such devices are highly diverse in speed, in the amount of data transferred in a single 
operation, and in the degree of processor intervention required. Some operating systems 
attempt to bring most peripheral devices within a standardized protocol for setting up 
data transfers (e.g., RT-11 for the PDP-11), while others may treat all terminals and 
printers in one way, all disks in another way, etc. (e.g., OpenVMS for VAX and Alpha). 
These distinctions about external devices, which lie outside the scope of this book, are 
treated in books on operating systems. 


Commonly, all but the simplest devices in the simplest computers have the capa- 
bility of moving data directly to and from memory after the processor has specified 
overall parameters such as a starting memory location and a total amount of data to be 
moved. Special hardware devices called direct memory access (DMA) controllers allow 
such transfers to proceed in a time-overlapped fashion with other computing activity 
which does not require or modify the particular data being moved. The components of 
system software directly concerned with I/O devices are often called device drivers. 
Again, this book does not attempt to discuss device drivers in any detail, although we 
may note in passing that these are the operating system modules that are the most likely 
to be written in assembly language or in a language like C which facilitates tightly con- 
trolled programming techniques. 


In Alpha systems, most I/O devices are connected in the first instance to a high- 
performance bus (such as PCI), and that bus is connected through a bridge controller to 
the memory—CPU bus. Ultimately, the I/O bus address space is made available to the 
processor and generally appears mapped into the virtual address space in a manner that 
does not conflict with the addressing of physical memory. An operating system can then 
use device drivers that communicate with peripheral devices using load and store 
instructions from and to those special addresses. 


Instruction Execution 


A computer is controlled by a program, which is composed of sequences of instruc- 
tions. At the assembly language level, each instruction specifies one fundamental 
Operation to be performed by the CPU. Instructions can be classified into five basic 
categories: 
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. Data movement instructions move data from one location in memory to another, between 
memory and processor registers, among processor registers, or between memory and I/O 
devices. 

2. Arithmetic instructions perform various mathematical operations, for example, the addi- 

tion of one quantity to another. 

3. Logical instructions perform Boolean operations such as AND, OR, and EXCLUSIVE OR. 

4. Comparative instructions examine a quantity, or compare two quantities, as part of setting 

up decisions for conditional program flow. 

5. Control instructions change the dynamic execution of a sequence of instructions by caus- 

ing program execution to proceed next to a specified instruction, either unconditionally or 

based on the results of a comparative, logical, or arithmetic instruction. 


These conceptual categories do not necessarily map one-to-one onto an actual machine 
architecture. For instance, the Alpha architecture uses two very different instruction 
types for data movement, one type to move data between memory and registers and 
another type to move data among registers. 

Instructions are stored in memory, as are the data on which they operate. Depend- 
ing on the architecture (see Table 1.2 for examples), an instruction in memory can 
occupy several consecutive words or bytes. Each architecture has its own representation 
of instructions, but every instruction is typically composed of two basic components: 


e an operation code (opcode), which specifies the function to be performed, e.g., 
add, subtract, compare, jump; 

e one or more operand specifiers, which describe the locations of the information 
units that must be accessed when the operation is performed. These information 
units (more precisely, their current contents) are the operands of the instruction. 


The processor fetches an instruction from memory, interprets the operation code, and 
executes the appropriate function on its operands. 

The location (memory address) of the next instruction to be executed is always 
held in the special internal CPU register called the program counter (PC). When a pro- 
gram is first loaded into memory, the PC is initialized with the address of the first 
instruction of the program. Using the PC, the CPU fetches this instruction along with 
any operand specifiers. The operands themselves are then fetched, the operation is per- 
formed, and any results generated are stored back in memory or in a register. Figure 2.4 
shows the complete cycle of fetching, interpreting, and executing an instruction. 
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Fetch instruction from the location pointed to by PC 
and increment PC 
Decode instruction 
(determine instruction type, e.g., ADD) 
Fetch operands specified by instruction 
Perform specified operation 
Store any results 


Figure 2.4 Instruction execution 


Each time that an instruction is fetched, the PC is modified to point to the next 
instruction. In this way, the processor can readily execute instructions one after another 
in sequence. A few instruction types produce skip, branch, or jump effects; these are the 
machine language equivalents of GOTO in a high-level language. When one of these 
instruction types is executed, it changes the contents of the PC in such a way that a new 
location, no longer the next one in sequence, will be taken as the information unit con- 
taining the “next” instruction to be executed. 


A complete computer program will consist of both instructions and data, with 
each instruction or data element loaded at a unique address in memory when the pro- 
gram runs. If we were to inspect arbitrary memory units, we would have great difficulty 
differentiating data from instructions because both are merely numbers coded in binary. 
Moreover, there may well be numerous sorts of data elements, such as integers, float- 
ing-point numbers, or character strings. How does the computer “know” that some of 
these numbers are machine instructions, while others are data? 


When a program is produced using a compiler or assembler and other system soft- 
ware, its starting address is made part of the stored form of the program. The loading 
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process sets that address, which could really be anywhere in memory, into the PC just 
before the program starts to run. Once this single master “pointer” is known, everything 
has a context: instructions come along in sequence (unless there are changes of direc- 
tion), and the operand specifiers within each instruction guide the processor towards 
finding and appropriately treating each of the data locations. 


Classes of Instruction Architectures 


Computer instruction sets vary greatly from machine to machine. We can usefully clas- 
sify instruction sets into several categories based on the number of addresses contained 
within the typical instruction. 


Perhaps simplest would be a computer with a zero-address, or stack-based, 
instruction set. Part of the processor state for such a machine includes a stack pointer to 
the top of a last-in first-out (LIFO) list, or stack, which may reside either in memory or 
in special storage within the processor. The stack pointer is an implicit operand for most 
instructions; it is not specified explicitly in the instruction word. On such a machine, 
the instructions 


PUSH A 
POP B 


cause the contents of memory location A to be placed on the top of the stack (as a result 
of the PUSH); the contents of the stack are then removed and copied into memory loca- 
tion B (as a result of the POP). Arithmetic instructions always operate on the top one or 
two elements of the stack, depending on whether they are unary or binary operations. 
For example, the instruction 


ADD 


would add the top two elements of the stack together, popping those elements from the 
stack and then pushing the resultant sum onto the top of the stack. Such machines are 
called zero-address machines because arithmetic operations such as ADD have no 
explicit operands; the operands are implicitly known to be the top two stack elements. 
Some hand-held calculators behave like zero-address computing devices. 

A computer with a one-address instruction set typically has a single register, usu- 
ally called the accumulator. Just as the stack is an implied operand for instructions on a 
stack machine, so too this single accumulator is the implied operand for instructions on 
a one-address machine. We could add the contents of memory locations A and B, put- 


iim E E SENNA csi ining eget a cn ae ep eas 
28 Chapter 2 e Computer Structures and the Alpha 


ting the result in location C, on a one-address machine using a program fragment of the 
following sort: 


LOAD A 
ADD B 
STORE C 


The LOAD instruction copies the contents of memory location A into the accumulator. 
Then the ADD instruction finds and adds the contents of location B to the accumulator. 
Finally, the STORE instruction copies the contents of the accumulator into memory 
location C. In this architecture, it is not possible to add a number directly to the contents 
of a memory location; the summation must be formed in a processor register. The PDP- 
8 used one-address instructions. 

A two-address instruction set allows two operands to be specified in the instruc- 
tion. For example, 


ADD A, B 


adds the contents of location A to the contents of location B. When the instruction is 
completed, the contents of location A are unchanged while location B now contains the 
sum. (In some architectures the semantic order of the operands would be reversed, with 
a result of A := A + B instead of B := A + B. We have cited the operand semantics 
adopted in the architectures designed by Digital Equipment Corporation.) 

Finally, a three-address instruction set allows two source operands and one desti- 
nation operand to be specified in the instruction. For example, 


ADD Ar B; C 


adds the contents of location A to the contents of location B and stores the result of the 
summation in a third location. When the instruction is completed, the contents of loca- 
tions A and B are unchanged while location C now contains the sum. (Again, we have 
illustrated the semantic choice adopted by Digital Equipment Corporation.) 

These are just some of the options available to the designer of an instruction set, 
and several may be included in a given machine architecture. For example, the VAX 
provides for both the two- and the three-address ADD instructions just illustrated, while 
the PDP-11 only provided for the two-address version. In general, the more operands an 
instruction set allows, the more powerful the instructions tend to become. Yet as the 
instruction set becomes more powerful in this way, the hardware required to implement 
the complete instruction set also becomes more complex. 
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The Alpha is a two-address machine with respect to load and store operations, 
which are the only instruction types that may reference the contents of memory loca- 
tions. In such instructions, one address refers to a memory location and the other to a 
processor register. The Alpha is a three-address machine with respect to integer, logical, 
and floating-point operations whose operands are all contained in processor registers; 
the three registers to be involved are specified within the instruction. Other details about 
Alpha instructions will be taken up in Chapter 4. 


Introduction to the Alpha 


We begin a description of the Alpha architecture with a brief comparative glimpse back 
upon its precursors in the superfamily outlined in Chapter 1 (Table 1.2). 

The PDP-11 was a best-selling range of 16-bit minicomputers. The PDP name 
had been chosen earlier by Digital Equipment Corporation for one of its first comput- 
ers that was to be marketed for applications in scientific laboratories as a Pro- 
grammed Data Processor. Other PDP architectures had word sizes of 12, 18, and 36 
bits. The VAX architecture was designed to extend the addressing capabilities of its 
16-bit PDP-11 predecessor to 32 bits, hence the name VAX for Virtual Address 
eXtension. Both the PDP-11 and VAX are said to be memory—memory architectures; 
that is, many data manipulation instructions can refer directly to memory locations as 
both source and destination for operands. This imparts a sense of tremendous versatil- 
ity to the instruction sets of these machines. The PDP-11 and VAX also have numer- 
ous register—register instructions. 

The Alpha is a 64-bit load/store RISC architecture designed to be able to take 
advantage of technological improvements over a projected lifetime of more than two 
decades through implementations that differ in chip design, clock speed, multiple 
instruction issue (i.e., ability of one CPU to begin working on more than one instruction 
at once), and multiple processors. All registers are 64 bits in width, and all operations 
on data are performed in those registers. Therefore, the Alpha can also be described as a 
register-register architecture. Separating memory access (load/store) from data manip- 
ulations in a RISC architecture brings performance gains in those implementations 
which exploit pipelining, instruction scheduling, and parallel operational units. 

The basic Alpha instruction set is quite lean. Each instruction fits compactly into 
one longword (32 bits), in keeping with RISC design principles. Implementations with 
multiple issue, called superscalar machines, can fetch 2, 4, ... instructions from mem- 
ory in order to achieve high performance. Indeed, even the first Alpha implementations 
had dual processing paths inside the CPU to permit simultaneous processing of two 
instructions, provided one instruction involved integer operands and the other involved 
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floating-point operands. Such advantages come at the cost of larger program size, since 
several Alpha instructions may be required to perform the equivalent of one VAX 
instruction (on average, perhaps twice as many). 

Some common operations like mov that we used in the SQUARES program (Fig- 
ure 1.2) are actually special cases of versatile instructions. The assembler provides 
pseudo-instructions like mov for supplementary convenience. The Alpha also comes 
with a Privileged Architecture Library (PALcode), a set of subroutines that are specific 
to a particular Alpha operating system implementation. These manufacturer-supplied 
routines are themselves written in standard machine code, but may execute with hard- 
ware interrupts suppressed. The concept behind PALcode is similar to the motivation 
for the BIOS libraries in a personal computer that bridge between implementation-spe- 
cific hardware and more generic system software. Chapter 13 contains a brief introduc- 
tion to PALcode. 

Virtual addresses for the Alpha architecture are potentially 64 bits long, though 
the architectural specification permits the virtual address space of an implementation to 
be 43 bits minimally. Since address bits are used to specify 8-bit bytes, the Alpha is a 
byte-addressable system. Nevertheless, memory access actually occurs in multiples of 
several bytes, preferably quadwords (8 bytes); certain bus transfers of data may even 
occur as octawords (16 bytes). Furthermore, performance is hindered by a penalty 
approaching 100-fold unless memory accesses can proceed uniformly with quadword- 
aligned data. If the lowest 3 bits of an address are zero, that address is said to be quad- 
word-aligned; if only the lowest 2 bits are zero, the address is said to be longword- 
aligned. These important distinctions will come up again later. 


Alpha Information Units and Data Types 


The basic information unit on the Alpha is the 8-bit byte. Individual bytes are given 64- 
bit addresses, but it is also important to understand that groups of adjacent bytes have 
addresses, as shown in Figure 2.5. These multi-byte units, which were mentioned in 
Chapter 1, include the 16-bit word (2 bytes), the 32-bit longword (4 bytes), and the 64- 
bit quadword (8 bytes). Such units are always addressed by the low-order byte of the 
group. Similarly, the addresses of the higher-order bytes within the larger information 
units take on the successive values beyond the address of the lowest-order byte. 
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Byte 
07 00 
Word 
15 00 
w 
Longword 
31 00 
Quadword 
63 00 


Byteo fia 


Figure 2.5 Alpha information units 


In the convention that has always been used by Digital Equipment Corporation, 
the individual bits within any information unit are numbered from the least significant 
bit, bit 0, on the right. The most significant bit is then bit 7 for a byte, bit 15 for a word, 
bit 31 for a longword, and bit 63 for a quadword. Some other machine designers have 
adopted the opposite numbering convention by naming the most significant bit on the 
left as bit 0. Note that the convention for the Alpha does have the convenience of corre- 
sponding directly to the positional weighting scheme for evaluating binary values that 
was presented in Chapter 1. That is, the weight of bit i is 2i, 


For the Alpha the corresponding convention for ordering the bytes within words, 
longwords, and quadwords is to store the lowest order byte of the group at the lowest 
address. This is the little-endian convention, which also applies to Intel chips. The 
opposite convention, where the highest order byte of a group is stored at the lowest 
address, is called big-endian, which is followed by the Motorola 680x0 and the Motor- 
ola-IBM PowerPC chips. When character string data are transmitted between systems, 
the bytes travel in the same order as letters in words and words in sentences in Western 
languages. But when little-endian and big-endian systems attempt to break up, say, a 
32-bit binary number into four 8-bit binary bytes for sequential transmission, what one 
system views as WXYZ will be perceived by the other as ZYXW. This problem affects 
only the byte ordering; all systems agree on the ordering (but not the numbering!) of the 
bits within bytes. 

Let us consider, as a specific example of.data storage, that the quadword quantity 
QFOEODOCOB0A0908 16 is stored at address Q. Location Q is then also the address of 
the longword whose value is OBOA0908 16, the word whose value is 090816, and the 
byte whose value is 0816. In similar fashion, location Q+1 is the address of the byte 
whose value is 0916, the word whose value is 0A09j6, and so forth. The Alpha uses sep- 
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arate opcodes, 1d1 or 1dq and st1 or stq, in order to specify whether a longword or 
quadword type of information unit is to be loaded or stored in a data transfer between 
memory and a register. 


Alpha integer registers RO ... R31 are 64 bits wide and can thus accommodate any 
of these four information units. Bytes can be modified within a register through certain 
instructions designed for that purpose, but cannot be transferred directly between regis- 
ters and memory. (An extension of the originally defined architecture introduced a more 
direct way to access byte-width data for later Alpha implementations.) 


What interpretations can be placed on the bit patterns stored in these information 
units? The fundamental data types supported by the instruction set of the Alpha archi- 
tecture are integers and floating-point numbers. In addition, a compiler program or an 
assembly language programmer can impart further purposes to integers, for example, to 
represent characters or Boolean variables. 


Integers 


We reviewed the concepts of binary representation of integers in Chapter 1. A span of N 
bits can be used in one of two ways: to represent a range of unsigned integers, 0 to 24— 
I, or to represent a range of signed integers, -2-1 through 0 to +24-!_1. Table 2.1 
shows the numeric ranges for the various integer sizes that are pertinent to the Alpha. 


Table 2.1 Integer Data Types 
eee 


Numeric Range (expressed in decimal radix) 


Type = Bits Bytes Signed Unsigned 
Byte 8 i —128 to +127 0 to 255 
Word 16 2 —32,768 to +32,767 0 to 65,535 
Longword 32 4 —2,147,483,648 to 0 to 4,294,967,295 
+2,147,483,647 
Quadword 64 8 —9,228,372,036,854,775,808to Oto 


+9,228,372,036,854,775,807 18,446,744,073,709,551,615 


The Alpha has arithmetic instructions for both quadword and longword integers. 
Byte-manipulation instructions facilitate packing and unpacking smaller information 
units within quadwords. The basic Alpha instruction set provides no direct support for 
word-length integer data. Logical instructions work only with quadword-length data; 
these provide some capability for access to data packed at the bit or group-of-bits level. 
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Floating-Point Numbers 


Since integers may lack the dynamic range necessary for certain scientific applications, 
most computer architectures provide for floating-point numbers also. Floating-point 
numbers in a computer correspond to scientific notation. Whereas hand-held calculators 
display numbers as a decimal number (with a fractional part) that is multiplied by some 
power of 10, computers typically represent non-integer data as a binary fraction that is 
multiplied by some power of 2. The fraction, exponent, and sign of a number can be bit- 
packed into an information unit in several ways. 


The PDP-11 architecture supported a 4-byte representation of floating-point num- 
bers which was later called F_floating in the VAX architecture. The VAX introduced 
several additional representations: D_floating (8 bytes with 32 more bits in the fraction 
than F_floating), G_floating (8 bytes with 3 more bits allocated to the exponent and 3 
fewer to the fraction as compared to D_floating), and H_floating (16 bytes). The 
G_floating and H_floating representations were more comparable to those adopted by 
certain other computer manufacturers. 


After the VAX appeared but before the Alpha was designed, the computer engi- 
neering community established a set of industry standards for floating-point numbers 
called ANSI/IEEE 754-1985. Of the several standard IEEE representations, typical 
Alpha implementations support the basic IEEE single (S_floating) and double 
(T_floating) formats whose characteristics are given in Table 2.2. The Alpha also sup- 
ports VAX F_floating and G_floating formats (with slight approximations in the frac- 
tions) as well as VAX D_floating format in a very limited way, but not H_floating 
format at all. In this book we omit the proprietary VAX-compatible representations; 
instead we describe and use only the new IEEE formats. 


Table 2.2 IEEE Floating-point Numbers in the Alpha 

hla es LEM de OLE SORIA SLATE MEN eee 
S_ floating T_floating 

i pein SSI i cence 


Size of representation in memory 


Sign 1 bit l bit 

Exponent 8 bits 11 bits 

Fraction 23 bits 52 bits 
Exponent bias 127 1023 
Minimum magnitude 1.175 x 10738 2.225 x 107308 
Maximum magnitude 3.403 x 10+38 1.798 x 10+308 
Precision 6 decimal digits 15 decimal digits 


OO 
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The IEEE representations not only facilitate direct interchange of data between 
computer systems with different architectures, but also provide for special values that 
could not be represented in the VAX formats. For example, special bit patterns are used 
to represent positive infinity and negative infinity. These obey standard algebraic rules, 
such as to ensure that positive infinity plus a valid finite number yields positive infinity 
as the sum. Other special bit patterns are called NaN, not a number. These can be used 
when a computed result is algebraically indeterminate, such as infinity minus infinity. 
As this book does not emphasize the numerical analysis of algorithms, we will have lit- 
tle more to say about such special values. 


An IEEE T_floating datum (double precision) occupies 8 adjacent bytes in mem- 
ory starting on an arbitrary byte boundary, though an Alpha system will experience 
large performance penalties unless T_floating data are naturally aligned (i.e., quadword 
aligned). The bits are labeled from right to left, 0 through 63, as follows: 


15 14 4 3 0 









= Feann |: 1 





When these portions of the datum are brought into a floating register, their arrangement 
is as follows: 


63 62 952 S51 48 47 32 31 16 15 0 


Bit 63 is the sign bit, bits <62:52> represent the exponent of 2 biased by addition of 
1023 to the true value, and bits <51:0> represent a 52-bit fraction. The actual fraction is 
adjusted so that the leading bit would be 1, then shifted by one more position to the left 
so that this logically known bit does not have to be represented physically. The preci- 
sion of the fraction is thus one part in 253. Except for special cases, the value of the 
number is 


(1-2xS)x 1.F x 2(E - 1023) 


If all the bits in the representation are zero, the number represented is zero by conven- 
tion. 
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An IEEE S_floating datum (single precision) occupies 4 adjacent bytes in mem- 
ory starting on an arbitrary byte boundary, though an Alpha system will experience 
large performance penalties unless S_floating data are naturally aligned (i.e., longword 
aligned). The bits are labeled from right to left, 0 through 31, as follows: 


15 14 7 6 0 
Fraction lo 7s 


When these portions of the datum are brought into a floating register, their arrangement 
is as follows: 


63 6252 51 45 44 29 28 0 


That is, the S_floating load instruction left-justifies the number into the 64-bit register. 
At the same time, the exponent is expanded from 8 to 11 bits and the additional extreme 
low-order bits of the fraction are extended as zeros. In the register this transformation 
results in the equivalent of a T_floating number that is suitable for either S_floating or 
T_floating arithmetic operations. When the S_floating number is stored in memory, bit 
31 is the sign bit, bits <30:23> represent the exponent of 2 biased by addition of 127 to 
the true value, and bits <22:0> represent a 23-bit fraction. The actual fraction is 
adjusted so that the leading bit would be 1, then shifted by one more position to the left 
so that this logically known bit does not have to be represented physically. The preci- 
sion of the fraction is thus one part in 224, Except for special cases, the value of the 
number is 


(1-2xS)x 1.Fx 2 - 127) 

If all the bits in the representation are zero, the number represented is zero by conven- 
tion. 

As one example, the S_floating representation for the decimal number 4.25 as it 
would be stored in memory can be constructed using the following steps: 

4.2519 =27+22 =100.01> (convert from base 10 to base 2) 

= 1.0001 x 22 (shift into normalized form) 

After the “hidden bit? is suppressed, the binary fraction is F = 


000 10000000000000000000 (23 bits in all). The true exponent is 2, but with the bias of 
12710 this becomes 129410 or E = 100000015. The sign is S = 0 since the number is pos- 
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itive. Putting all those pieces together in the order S-E-F, we have 


O 10000001 00010000000000000000000 
S E F 


By reclustering 4 bits at a time, we can deduce what this would look like if it were to be 
printed as an unsigned hexadecimal number: 


0100 0000 1000 1000 0000 0000 0000 00002 = 4088000016 


Note that only a number whose fractional part can be represented exactly as a sum of 
inverse powers of 2 can be stored exactly. Common decimal fractions like 0.1 or 0.7 
cannot be stored exactly. 

As another example, the 32-bit pattern 4126000016 for an S_floating number 
stored in memory can be interpreted by reversing the steps just illustrated: 


4126000016 = 0100 0001 0010 0110 0000 0000 0000 00002 
= © 10000010 01001100000000000000000 
S B F 
= + 1.010011% x 2130-127 = 1.01001 12 x 23 = 1010.0112 
= + (8 + 2 + 0.25 + 0.125)10 = 10.37510 


Conversions for T_floating numbers would proceed in a similar fashion. 


Alphanumeric Characters 


Binary numbers can encode any information, including alphanumeric characters. In the 
Alpha, the American Standard Code for Information Interchange (ASCII) is used. The 
ASCII character set includes both uppercase and lowercase alphabetic characters (A 
through Z, and a through z), the decimal digits (0 through 9), punctuation marks, and 
special control characters. The ASCI code was accepted by the American National 
Standards Institute (ANSD to standardize the exchange of textual information between 
computers and peripherals of different manufacturers. This code exists in 7-bit and 8-bit 
forms; for simplicity we show the 7-bit chart as Table 2.3. 
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Table 2.3 


Hex 
Code 


00 
01 
02 
03 
04 
05 
06 
07 
08 
09 
0A 
0B 
0C 
OD 
OE 
OF 
10 
11 
i2 
13 
14 
15 
16 
17 
18 
19 
1A 
1B 
H 
1D 
lE 
1F 


ASCII Character Encoding 


ASCII 
Character 


ASCII 
Character 


NUL 
SOH 
STX 
ETX 
EOT 
ENQ 
ACK 
BEL 
BS 
HT 
LF 
VT 
FF 
CR 
SO 
SI 
DLE 
DA 
DC2 
DC3 
DC4 
NAK 
SYN 
ETB 
CAN 
EM 
SUB 
ESC 
FS 
GS 
RS 
US 


Hex 
Code 


20 
21 
22 
23 
24 
25 
26 
Pa 
28 
29 
2A 
2B 
2C 
2D 
2E 
2F 
30 
31 
32 
33 
34 
be 
36 
37 
38 
39 
3A 
3B 
3C 
3D 
3E 
SF 


SP 


O on Aoa A U N e O N * 


I A =m 


mx V 


Hex 
Code 


40 
41 
42 
43 
44 
45 
46 
47 
48 
49 
4A 
4B 
4C 
4D 
4E 
4F 
50 
51 
52 
53 
54 
55 
56 
57 
58 
59 
5A 
5B 
5C 
5D 
5E 
5F 


ASCII 
Character 


>= -ONK KM EK GHUAAKVOZZMA““TAAMMOAWPF e 


Hex 
Code 


60 
61 
62 
63 
64 
65 
66 
67 
68 
69 
6A 
6B 
6C 
6D 
6E 
6F 
70 
71 
72 
T3 
74 
Ta 
76 
a? 
78 
79 
7A 
7B 
Ss 
7D 
TE 
TẸ 


ASCII 
Character 


Jii "oOo VOB a a a -a a a E a a ® 


-=~ N xX K g< e 


DEL 
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Any ASCII character-oriented peripheral, such as a line printer or a terminal, will 
output an A when the ASCII code for A (41 hexadecimal) is sent to it. Similarly, such 
devices should provide a horizontal space in response to the SP nonprinting character 
(20 hexadecimal). The ASCII encoding of the string My Alpha would be as follows: 


: STRING 





Note that each character uses one byte of storage (i.e., two hex digits). The entire string 
can be referenced by the address of its first byte containing the representation of the 
character M at symbolic address STRING. 


The arrangement of Table 2.3 makes evident the convenient feature of ASCII cod- 
ing that corresponding uppercase and lowercase letters differ by only a single bit. 
Uppercase A is 41 hexadecimal (0100 0001), and lowercase a is 61 hexadecimal (0110 
0001). This relationship simplifies case conversion or collapsing of the two cases to 
facilitate certain alphabetic sorting operations. 


About one-fourth of the 7-bit ASCII codes designate control characters intended 
for device control. The presence of these extra codes has given the ASCII code its ver- 
satility in such areas as the control of laboratory instrumentation through relatively sim- 
ple interfaces attached to the serial communication ports of inexpensive 
microcomputers. 


Any string has two attributes: an address and a length in bytes (or number of char- 
acters). The Alpha, unlike the VAX, has no machine instructions which are intended 
specifically for manipulating strings as a special data type. Therefore the programmer 
or compiler has to take responsibility for managing strings as data structures. 
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Summary 


We have introduced the basic elements of the computer from the viewpoint of the 
assembly language programmer. Important concepts include the instruction execution 
cycle, the way that addresses are assigned to information units in memory, and the 
encoding of various data types within information units. We have illustrated these for 
the Alpha architecture in particular, with an occasional backward glimpse to their coun- 
terparts for the predecessor VAX and PDP-11 architectures. 

When you have a mastery of these concepts, we can proceed in Chapter 3 to an 
explanation of the assembler that translates programs from a format readable by 
humans into the binary format actually processed by the computer. We shall introduce 
symbolic debuggers, which can interpret any binary number in many representations. In 
order for you to be secure in your understanding of such output in context, it is impor- 
tant for you to appreciate the many distinctions described throughout this chapter. 
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EXERCISES 


2.1 Discuss why the three principal structures of a computer system are largely invisible 
to a high-level language programmer. 


2.2 Is the bus a part of a computer’s architecture? Why, or why not? 


eee sss 
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2.3 What is an address? In Figure 2.2, the last address is given as N — 1. Why? If a mem- 
ory unit has an address 37 (base 10), how many memory units precede it? 


2.4 Compare and contrast a computer memory system to the human memory. 


2.5 What is the address space? How is the size of the address space determined? If a 
machine has a 4-bit address, how many addressable memory units are there? How are 
they numbered? 


2.6 If we wish to run a program with 69,326 bytes of instructions and data on a hypothet- 
ical computer, what does this imply about the size of addresses (in bits) needed on 
the computer? 


2.7 What are the parts of an instruction? Describe in detail the steps in the execution of 
an instruction. Why are instructions stored (and executed) sequentially in memory? 
Can you think of an alternative? 


2.8 You have seen the CPU execution cycle: instructions are fetched from sequential 
memory locations and executed, one after another, until a branch or change of control 
occurs. Can you imagine a machine in which instructions are not fetched sequen- 
tially? How would such a machine function? What would its instructions look like? 
Would it be “fun” to program? 


2.9 How many different memory addresses can be encoded in a 43-bit address? 


2.10 What are the information units on the Alpha? What distinguishes an information unit 
from a data type? What are the Alpha data types and what is each data type used for? 


2.11 A binary fraction is normalized if it begins with a one (the value of the fraction is 
then at least one-half). Why are normalized fractions used in floating-point number 
representations? 


2.12 Show the hexadecimal IEEE S_floating representation in memory for the decimal 
number 2.5. Show the hexadecimal IEEE S_floating representation for the decima 
number 2.6. Which was harder to compute? Why? | 


2.13 Repeat exercise 2.12 using T_floating representation. 


2.14 What is the largest odd integer that can be represented exactly in S_floating format in 
memory? 


2.15 Show the ASCII representation for this sentence. 


2.16 If the M of My Alpha is stored in a byte with address 1248 (hex), known symbolically 
as STRING, what is the address of the byte containing the a? 


2.17 (OpenVMS programming environment only.) Revisit exercise 1.9. Explain how to 
use the /byte qualifier of the examine command in order to verify that the full 
quadword results calculated by the program are correct. 


2.18 Extend the SQUARES program in Chapter 1 in order to compute a list of values of 
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one of the following polynomials for integer values of N from 1 through 5, again 
without using explicit multiplication instructions. 


a. N2+N 
b. 2N2-1 
c. N2-N+2 


The Alpha instruction subq Rx, Ry, Rz subtracts the contents of Ry from the con- 
tents of register Rx and places the 64-bit difference in Rz. Change the saved_regs 
specification to reflect the number of registers that you need for intermediate results. 
Store only the function values in memory locations. (This is only a pencil-and-paper 
exercise unless you have access to the MACRO-64 assembler in the OpenVMS pro- 
gramming environment; if you do have such assess and wish to execute your pro- 
gram, use the Control-Y technique to halt your program and then use the examine 
command to print out the computed results.) 





CHAPTER 3 


The Program 
Assembler and 
Debugger 


[ n writing this book, we have assumed that you are 
already familiar with programming in one or more standard high-level languages such 
as C, FORTRAN, and Pascal. We expect that you will continually perceive similarities 
and contrasts between these levels of languages throughout our explanations in this 
book about Alpha assembly language in particular, and that you will thereby become 
better prepared to understand other computer architectures and their intrinsic perfor- 
mance characteristics. 


A system program called a compiler is required to translate a high-level language 
program (the source program) into a machine language program. Similarly, a system 
program called an assembler converts an assembly language program into a machine 
language program. Usually the machine language program produced by the compiler or 
assembler is first produced in a version called an object file. Another system program 
called a linker combines one or more of these object files, depending on the complexity 
and modular construction of the overall program, into a final binary form called an exe- 
cutable image. This is a disk file containing machine instructions that can be loaded 
directly into memory by the operating system and executed by the central processing 
hardware in the stepwise fashion already described in Chapter 2. 


High-level languages assist you with declaring constants and variables and with 
allocating appropriate memory storage for them. You have to be much more conscious 
of storage allocation when you work in assembly language, but you do not have to 
assign a specific numeric address for everything. The assembler program permits you to 
refer to storage locations and to specific milestone points within the program using 
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symbolic names. Then if you need to insert something during a subsequent program 
revision, the assembler bears the brunt of the workload of adjusting all of those 
addresses for you. 

High-level languages have major advantages over an assembly language. These 
include algebraic or English-like style as well as constructs that make expressing an 
algorithm relatively easy. Moreover, the core capabilities of the major languages have 
been defined by standards-setting groups in order to assure a good degree of machine 
and vendor independence leading to the concept of portability of an initial program- 
ming effort. When a given vendor introduces extensions of a standard language, those 
are accompanied by warnings about lack of portability. In contrast, the close tie-in 
between any vendor’s assembly language and the corresponding specific computer 
architecture represents the most serious drawback to the selection of assembly language 
for development of software applications where marketing for different architectures is 
thought to be viable. Yet that close tie-in explains why we feel that some direct experi- 
ence with assembly language leads first to a good understanding of one particular archi- 
tecture and then to a deeper appreciation of architectural principles. 


Program Development Steps 


Few computer programs are successfully produced in final form the first time. Large 
real-world software development projects also involve the endeavors of teams of design- 
ers, writers, and testers. Even when only one person’s efforts are involved, a measure of 
discipline in the development process almost always leads more quickly to a better result 
than a haphazard approach. Figure 3.1 schematically outlines the stepwise process of 
developing a software component using the typical tools of a programming environment, 
which may include a text editor, an assembler (or compiler for a higher level language), a 
linker, and a loader (or system command to run a program). Not all of the features in Fig- 
ure 3.1 are provided in every programming environment. In fact, that is a major reason 
why we present illustrations from different Alpha-based programming environments in 
this chapter and throughout this book, selecting one to illustrate a point there and then 
another to illustrate a different point here. 





Program Development Steps 
Proposer 


Problem statement (with constraints) 


ration > 















Module listing file Module object file Other object files 
Command line modifiers System libraries 


Program map file Program image file 

Run-time input 
Probisti FOSUIS g/fix logic errors 

C Evaluator > fix algorithmic errors 


Figure 3.1 Stepwise development using programming environment tools. 
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Figure 3.1 also shows several roles for people throughout this overall process; 
these may include a proposer, an author, and an evaluator. If you are a member of a pro- 
fessional development group, you may concentrate on one or a few such roles. On the 
other hand, if you are a student, you will probably be expected to take on each of those 
roles as part of your learning experience. Numerous feedback loops in this diagram 
emphasize the importance of comprehending and correcting various errors or mismatch 
situations that can impede progress toward a finished product. Almost invariably it is 
most efficient to intervene with corrections at the first point where anomalies arise. 
Moreover, if you see a discouraging flood of error messages, it is a good strategy to 
understand and attempt to fix only the first few of them in one iteration. The flood may 
then subside, permitting you to concentrate on another trouble spot in the next iteration. 

With Figure 3.1 as a guide, we may outline the major steps to be accomplished 
and the system tools to be used in assembly language programming, whether for writing 
a simple routine or producing a complex application, as follows: 


- 1. State the problem in a way that expresses a clear understanding of the known con- 
straints. 

2. Design an algorithm. Analyze the problem, starting from previous experience or 
existing exemplary models. Develop ideas that have the best chance of leading to 
an efficient algorithm. 

3. Produce or modify the source program conventionally named PROG. s (Unix) or 
PROG.m64 (OpenVMS). Editing can be done using standard editors for the par- 
ticular programming environment, such as vi and emacs (Unix) or EDT and TPU 
(OpenVMS). Alternatively, the source program can be developed on some other 
system such as a Macintosh® or Windows system, saved as a plain text file, and 
moved over a network connection to the Alpha system using FTP. Indeed, the pro- 
grams in this book were edited using BBEdit™ on a Macintosh PowerBook® 
computer and transferred to Alpha systems using the Fetch client FTP program. 
We could just as well have used WordPad and FTP on a Windows 95 personal 
computer. 

4. Produce a runnable program. With OpenVMS, this involves two separate steps, 
first using the assembler called MACRO-—64 and then the linker; certain file types 
are assumed for the source program (.m64), the intermediate object file (. obj), 
and the final executable file (. exe). With Unix, this can be done with a single 
step using the same “cc” command as for compiling C language programs, 
though the system software does actually use an assembler and a linker; thus the 
source file must be given a name ending in “ . s” as a signal to the cc command 
that assembly language is involved. Sample command lines are as follows: 
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$ macro/alpha/debug/list PROG (OpenVMS) 
S link/debug/map PROG (OpenVMS) 
> cc -g -00 -o PROG PROG.s (Unix) 
> nm PROG (Unix) 


where the flag -g or the qualifier /debug optionally requests the as- 
sembler and linker to include extra information in the output files that the 
symbolic debugger requires, where the qualifier /1ist requests a list- 
ing file (PROG. lis), where the qualifier /map or the nm command re- 
quests a map file (PROG.map) or information about symbol values, 
where the flag -00 inhibits any optimization or rearrangement of the 
flow of your program, and where the flag -o introduces a name for the 
output executable file. 


5. Run the program, possibly with the assistance of the debugger. 
> PROG (Unix) 
S run PROG (OpenVMS) 


Specific commands for running a program under control of a symbolic de- 
bugger are taken up later in this chapter. 
6. Compare actual output with expected output. 
7. Return to step 4 or 3 (trivial errors), step 2 (serious errors), or step 1 (if you are 
having an especially bad day). 


When the program is finally all right, you can (and should) produce a final version 
without inclusion of the debugging aids. 

As depicted in Figure 3.1, additional implicit inputs from vendor-supplied system 
libraries are involved both in the compilation or assembly process and in the linkage 
process. Those library inputs facilitate the use of such capabilities as well-standardized 
mathematical functions and support routines provided by the operating system. In large 
development projects, additional libraries of previously produced or purchased routines 
may be available; use of them may be obligatory according to the overall project design, 
methodology, and management. 

We should also take up the topic of case sensitivity here. By convention, many 
things in the Unix programming environment are case-sensitive, with a preponderance 
of lower case usage over upper case usage. Great care must therefore be taken with 
lower and upper case. Note, for example, the occurrence of both lower case letter “oh” 
(-o) and upper case “oh” with the numeral “zero” (-O0) in a command in step 4 above. 
In contrast, the OpenVMS programming environment accepts either lower case or 
upper case and construes them as equivalent in a great many circumstances. 
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We generally use lower case in this book both for ease of readability and for 
emphasis of similarities. If you read standard printed manuals describing the OpenVMS 
environment, however, you may often see upper case in situations where we have used 
lower case instead. 


SQUARES2: A Variant Example 


Later in this chapter we will discuss rerunning a variant of the SQUARES program 
originally introduced in Chapter 1 (Figure 1.2) under control of a symbolic debugger, 
which can inspect the contents of memory locations and registers. Let us assume that 
we have used some text editor to change SQUARES .M64 into squares2.s for Unix 
and squares2 .m64 for OpenVMS, shown in Figure 3.2. 


/* SQUARES2 Table of Squares (Unix) */ 
.data # Section for data 
.comm sql 8 # To store 1 squared 
. comm sq2 8 # To store 2 squared 
. comm sq3 8 # To store 3 squared 
# etc. 
.text # Section for program code 
salign 4 # Quadword alignment 
.set noreorder # Disallow rearrangements 
.globl main # These three lines 
.ent main # mark the mandatory 
main: = 'main' program entry 
ldgp Sgp,0($27) # Load the global pointer 
.frame Ssp,0,$26,0 # Describe the stack frame 
.prologue 1 # Say that $gp is in use 
sgiobl £irst 
first: mov Ly BE # R1 = first difference 
mov 2, $2 # R2 = second difference 
mov Lp # RO = first square 
stq $0,sql # to be stored 
addq SeSi (Sk # Adjust first difference 
addq S1,80,56 # RO = second square 
stq $0,sq2 # to be stored 
addq So yk.» BL # Adjust first difference 
addq 5.760, $0 # RO = third square 
stq s0,sa3 # to be stored 
# etc. 
.globl done 
done: mov 531,80 # Signal all is normal 
ret dai; (826), 1 # Back to Unix environment 


.end main # Mark end of procedure 
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.title SQUARES2 Table of Squares (OpenVMS) 
Sroutine Squares2, data_section_pointer=true, - 

kind=stack, saved_regs=<r2,r15> 
Sdata_section 


sqi:: -blkq 1 ; To store 1 squared 

Si2 2 .blkq k ; To store 2 squared 

sazr: ~<pLka L ; To store 3 squared 
> €Gtc.. 


Scode_section 


. base x27, Sls ; R27 -> linkage section 
ldq r15,$dp ; R15 -> data section 
. base r15,$ds ; Tell MACRO to use this 
first:: mov Ly, Ei ; R1 = first difference 
mov 2,£2 ; R2 = second difference 
mov 1, £0 ; RO = first square 
stq £U 8a ; to be stored 
addq r2,.71,71 > Adjust first difference 
addq L.O, rO ; RO = second square 
stq r0,.eq2 ; to be stored 
addq tZ;riri ; Adjust first difference 
addq 152020 ; RO = third square 
stq r0,sq3 ; to be stored 
; etc. 
done:: mov Tyro ; Signal all is normal 
$return ; Return to OpenVMS 
Send_routine squares2 ; Needed by Sroutine 
.end squares2 ; Set start address 


Figure 3.2 SQUARES2 program for use with symbolic debugger 


Note that the Alpha assembler for the Unix programming environment names the 
processor registers differently from the hardware manuals. Integer registers RO ... R31 
are denoted by $0 ... $31 and the floating-point registers FO ... F31 by S£0 ... $f31. 

You may have sensed that the intentional infinite loop in SQUARES would not 
represent a model for larger programming efforts. The reason for marking one program 
line with first and another with done in the variant SQUARES2 will become evi- 
dent when we discuss using a symbolic debugger later in this chapter. SQUARES2, 
rather than SQUARES, will thus serve as our current example for this chapter. 

There are certain aspects of SQUARES? that we have to gloss over in the first 
instance in this chapter, such as the reasons for the first and last few lines in each vari- 
ant. We generally endeavor not to oversimplify, however, and such details will be 
explained more fully in later chapters. 
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Assembler Statement Types 


Learning a new language—whether a human language or a computer language— 
involves at the outset quite a large amount of new terminology just to get started. When 
we first introduced the program SQUARES in Chapter 1 (Figure 1.2), we left many 
details unexplained until later. Although we will now proceed somewhat more system- 
atically, you will still need to “go along with the flow” when we have to introduce sev- 
eral new concepts all at once and then return to discuss them individually at a later time. 


We can think of each line in an assembly language program as being a statement 
that may be imperative, declarative, or of a controlling nature. 


e Imperative statements represent machine instructions in symbolic form (for exam- 
ple, addq for an addition instruction or mov for a data transfer instruction in Fig- 
UPS 3.2); 

e Declarative statements control allocation of storage or perform various naming 
functions. That is, such statements are not actual machine instructions; rather, they 
reserve space, define symbols, or assign particular initial contents to memory 
locations. For example, . title names the routine and .b1kq reserves a block 
of quadwords in Figure 3.2. 

e Control statements allow the programmer to have some control over certain por- 
tions of the assembly process. For example, Sroutine in Figure 3.2 produces a 
proper pattern of instructions to begin the program, adapting a predefined general 
pattern using programmer-specified details such as which registers to save. 


Declarative statements, which have names beginning with a dot, are only directives to 
the assembler program; they do not generate machine instructions that can be executed 
at run time. Control statements may indirectly produce machine instructions (such as 
those to save or restore register contents). 


Statement Format 


Unlike high-level languages where the syntax for various statement types may differ, an 
assembly language program uses the same general format for every line with four ele- 
ments in a standard order: label, operator, specifiers, comment. The style of assembly 
language statements differs from one computer manufacturer to another—indeed from 
one programming environment to another. For example, the punctuation character to 
separate in-line comments varies, just as it does from one high-level language to 
another. For the Alpha, the assembly language statement format is as follows: 
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Label: Operator Specifierl, ... # Comment (Unix) 
Label: Operator Specifierl, ... ; Comment (OpenVMS) 
where 


e Label is a symbolic address in the form of a character string terminated by a 
colon (optionally by two colons for OpenVMS). 

e Operator names a specific statement, which may be one of the three types just 
described. This is a single word; a space or tab character marks the end of this 
field. 

e Specifier is a symbolic name, value, or expression that completes the state- 
ment. For a machine instruction, a specifier is an operand such as the name of a 
processor register or a symbolic memory address. For a directive, it is some kind 
of argument or parameter. Multiple specifiers are separated by commas (spaces 
are generally ignored). The last specifier is delimited by the number sign or semi- 
colon in front of a comment, or by the end of the line. 

e Comment is a human-language description of the function of the whole line. If 
present, a comment must begin with a number sign (Unix) or semicolon (Open- 
VMS); it ends at the end of the line. For Unix, the C language convention for 
multi-line comments is also supported (/* ... */). 


The label and comment fields are optional. The number of specifiers depends on the 
statement type. For OpenVMS only, a statement may be continued onto another physi- 
cal line by putting a hyphen before the comment field (see $routine statement in 
Figure 3.2). With these few restrictions, the statement format is free-form and not 
bound to specific columns. Spaces and tabs are generally interchangeable. For best leg- 
ibility, however, we encourage keeping the fields lined up neatly. 

In this book we will introduce only some of the capabilities of the OpenVMS 
Alpha assembler (MACRO-64) and the counterpart Unix assembly language for the 
Alpha. Fuller details are given in the vendor manuals cited in the references at the end 
of this chapter. In the main, we have selected aspects that are common across program- 
ming environments and that have analogues for working at the assembly language level 
for other processor architectures. 


Symbolic Addresses 


One almost always assigns symbolic names to data locations like sq3 in SQUARES2, 
rather than assign numeric addresses manually. The assembler and linker keep track of 
the numeric values corresponding to such symbolic names for us. We can also assign 
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symbolic names to particular statements in an assembly language program. Those 
names can then be referenced in other statements such as branches or procedure calls 
that direct the logical flow of the program. Again, since the assembler and linker keep 
track of the actual numeric addresses, we do not have to do so ourselves. 

In Alpha assembly language, the notation label: or label: : at the beginning 
of a line creates a symbol and equates it to the address of that statement. In this way, the 
assembler helps us find particular data or instructions because we can use such symbols 
in the specifier fields of other statements. 

We should perhaps emphasize that the labels which we assign to data locations, 
such as sq3 in SQUARES2, represent the addresses of the storage locations, not the 
values stored there. That is, an st q instruction really means “store the specified quad- 
word value into the specified quadword location in memory.” 


Classes of Alpha Assembly Language Operators 


Our sample program illustrates several of the particular items that can appear in the 
operator field of a line in an Alpha assembly language program. We organize these into 
various classes in Table 3.1, such as opcodes, pseudo-operations, assembler directives, 
system-defined macros, and user-defined macros. We also include a few other operators 
beyond those in SQUARES and SQUARES2 in order to prepare you for some of the 
exercises at the end of this chapter. 


Table 3.1 Classes of Operators in SQUARES and SQUARES2 


Class Operator Purpose 

Alpha opcodes addq Quadword addition 
addl Longword addition 
subq Quadword subtraction 
subl Longword subtraction 
Tig" Quadword from memory to integer register 
ldal* Longword from memory to integer register 
stq* Quadword from integer register to memory 
sti* Longword from integer register to memory 
br Unconditional branch 
ret Return to caller 

Alpha pseudo-operations 1dgp Load Unix global pointer for access to data 


mov Put integer constant into register 


Assembler Statement Types 


Table 3.1 


Class 


Unix assembler directives 


OpenVMS assembler 


directives 


OpenVMS system- 


defined macros 


OpenVMS user-defined 


macros 


Operator 
.data 


.comm 
text 
.align 
.set 
.globl 
.ent 

. frame 
.prologue 
.end 


title 


.blkq 
DLlkL 
.base 


.end 


Sroutine 


Scode_section 
Sdata_section 


Sreturn 


Send_routine 


Classes of Operators in SQUARES and SQUARES2 (continued) 


Purpose 
Switch to memory region for data 


Allocate named storage 

Switch to memory region for instructions 
Round up to specified address granularity 
Specify assembler behavior 

Make a symbol globally accessible 

Mark entry of a procedure 

Describe the kind of routine 

Mark end of prologue section of procedure 
Mark end of procedure introduced by .ent 


Define module name and listing heading 


Allocate quadword storage 
Allocate longword storage 
Declare that a register holds a base address 


Mark last physical line of source file and set 
a symbolic starting address 


Set up memory regions, save registers, define 
the kind of routine, etc. 


Switch to memory region for instructions 
Switch to memory region for data 


Restore registers saved by Sroutine; exit 
to the calling routine or operating system 


Close up data structures for a routine 


(none in SQUARES2—see Chapter 9) 


DEUS 


*The Unix assembler may produce two-instruction sequences for load and store operations; the OpenVMS 
assembler always treats these strictly as primitive machine instructions. 


Names for opcodes like 1dq and sub] are usually chosen as mnemonic references 
to the various fundamental instructions supported by an architecture, here the Alpha 
architecture. The assemblers for the Alpha also predefine certain pseudo-operation codes 
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like mov which commonly are fundamental operations in other architectures (refer to 
Figure 1.2), but which can be accomplished as special cases of certain Alpha instructions 
and thus need not be redundantly implemented at the hardware level. Directives to 
assemblers from Digital Equipment Corporation have names that conventionally begin 
with a dot (again, refer to Figure 1.2). Similarly, system-defined macros (OpenVMS) 
have names that conventionally begin with a dollar sign. 


The Functions of a Symbolic Assembler 


Assembly language programs translate a symbolic assembly language program line by 
line into a corresponding binary machine language program. During translation, the 
assembler performs the following functions: 


1. It builds an internal symbol table that contains the values of all user-defined labels 
and other symbols. 

2. It maintains location counters that determine where the next instruction or data 
item will be placed in memory. 

3. It translates the symbolic instruction opcodes and operand specifiers into binary 
machine code, called the object file. 

4. It may produce a listing file for the programmer, showing the instructions and data 
and how these were translated and assigned to unique memory locations. 


In order to explain how an assembler works, we look next at the elements and mecha- 
nisms it utilizes during the translation process. These features include the specification 
of constants, the definition and use of symbols, the management of storage allocation, 
the location counters, the evaluation of expressions, control statements, and sometimes 
a listing file. 


Constants 


Alpha assemblers interpret all constants appearing in the specifier field as non-negative 
integers by default. Negative numbers can be produced using the unary minus operator 
(described later). Integer constants are not followed by a decimal point. (The specification 
of floating-point constants, where a decimal point may occur, is taken up in Chapter 9.) 
Table 3.2 summarizes how to control the assembler’s interpretation of the radix 
for a constant. The Unix assembler follows the same convention as compilers for the C 
language; that is, an octal number begins with a zero, a decimal number begins with a 
non-zero digit, and a hexadecimal number begins with the prefix 0x. The OpenVMS 
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assembler interprets numbers as decimal by default, or according to a unary prefix 
(Table 3.2). Here are some examples for the value twenty-nine: 


035 29 Ox1d (Unix) 
Ab11101 *035 “429 x14 (OpenVMS) 


Somewhat uncharacteristically, case does not matter in the Unix environment when 
specifying the digits of hexadecimal numbers (0x or 0X; a-f or A-F) to the assembler. 


Table 3.2 Control of Radix for Specifying Constants with the Alpha Assembler 


Radix Valid characters Unix OpenVMS 
Binary 0 and 1 no special method a e TF 
Octal 0to7 in ji- 
Decimal 0 to 9 non-0O... i Oe 
Hexadecimal 0 to 9 and a to f o> ae ji is 





Several mov statements in SQUARES2 (Figure 3.2) contain constants. Sometimes 
it is convenient to use a symbolic representation instead of an actual numeric represen- 
tation. For example, the direct assignment statements 


SIXTEEN_ONES = Oxffff (Unix) 
SIXTEEN_ONES = “xffff (OpenVMS) 


define (or redefine) a symbol to have the hexadecimal value fff f. Actually, in most 
contexts the symbol sixteen_ones will behave as the zero-extended 64-bit hexa- 
decimal value 000000000000ffff. 

The OpenVMS assembler also provides a way to define an ASCII constant by using 
the unary operator ^a with matching delimiters. For example, all of the statements 


Y 

bo 

Q 
| 


= “a/abca/ 
= i ti abcd" 


ABC = “aeabcde 


> 

Ww 

(2 
| 


equate the symbol ABC to the ASCII equivalent of the four characters abcd. Recall 
that the codes for a, b, c, d will be located in the byte order 0, 1, 2, 3 when stored as 
a longword or quadword using the little-endian convention adopted by Digital Equip- 
ment Corporation. (The Unix assembler only accepts single-character definitions, 
e.g., ABC="a". | 
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Symbols or Identifiers 


A symbol is a string of up to 31 characters. The Unix documentation also uses the word 
identifier as a synonym for symbol. Symbols may include letters a-z and A-Z (case is 
significant for Unix but ignored for OpenVMS), numerals 0-9, the period (. ), the dol- 
lar sign ($), and the underscore (_). Except for one special type of symbol used only for 
branch control, the first character of a symbol cannot be a numeral. In addition, symbols 
are restricted from conflicting not only with any register name but also with certain 
built-in directives beginning with a period. Digital Equipment Corporation uses dollar 
signs for many system-supplied items (OpenVMS); therefore, you can reduce the 
chance of redefining one of those by avoiding the dollar sign in your own definitions. 
Finally, periods should be avoided in symbols that are global because some languages 
like FORTRAN cannot tolerate them. 

As we have said, symbols appearing in the label field of a statement are defined by 
the assembler to represent the address of that line. 

Constants are a second type of symbol. These assignments are allowed for con- 
venience and documentation only. Declaration of symbolic constants near the head of 
a routine has a parallel in the compile-time symbolic declarations supported by some 
high-level languages. Such declarations do not generate machine instructions. The 
assembler or compiler simply substitutes the equivalent numeric or string value 
whenever the symbol occurs in any lines below the point of definition. 


Symbolic addresses defined as labels with a single colon (:) and symbolic con- 
stants defined by direct assignment using a single equal sign (=) are local to the pro- 
gram module where they are defined. For OpenVMS only, when a double colon (: : ) or 
double equal sign (==) is used, the symbol can be referenced globally, i.e., from all pro- 
gram modules that are combined by the linker into a complete program. For Unix, the 
.glob1 directive must be used instead. 


Storage Allocation 


Assemblers provide several types of directives or declarative statements to facilitate the 
storage of data and parameters for a program module. Some of these for the Alpha 
assemblers are listed in Table 3.3. For numeric data, there are groups of directives that 
reserve memory cells and store particular values into those cells, and there are other 
directives that just reserve storage but do not initialize it in any particular way. The 
former are appropriate where the data are “givens” for the program to work with; the 
latter are appropriate to allocate space for scratch usage or for computed results. 


a 
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Assembler Directives for Storage Allocation 


Meaning 


Store specified values in successive bytes in mem- 
ory from symbolic address label onwards. 


Store specified values in successive words in memory. 


Store specified values in successive longwords in 
memory. 


Store specified values in successive quadwords in 
memory. 


.s floating value-list Store specified values in successive memory units. 


.t floating value-list Store specified values in successive memory units. 


Table 3.3 
Directive 

label: .byte value_list 
label: .word value_list 
label: .long value_list 
label: .quad value_list 
label: 

label: 

abel: ascii “string” 


Unix-specific directives 


sasciiz "string" 


.comm name, size 


OpenVMS-specific directives 


label: 


label: 


label: 


label: 


label: 


label: 


.<asciz “string” 


address address_list 


BIKO count 


-oblkw count 


~olkl count 


blka count 


Store ASCII representation of string in succes- 
sive bytes in memory. (OpenVMS permits any 
matched pair of delimiter characters besides quota- 
tion marks.) 

Meaning 


Store the ASCII representation of string fol- 
lowed by a zero byte (the ASCII NUL character). 
Reserve number of bytes specified by the symbol 
size, with name corresponding to the first address. 
Meaning 
Store the ASCII representation of string fol- 
lowed by a zero byte (the ASCII NUL character). 
Store the 64-bit address of the specified locations in 
successive quadwords in memory. 


Reserve a contiguous block of count bytes of 
memory. The program should not assume any par- 
ticular initial values there. 


Reserve a contiguous block of count words of 
memory. 

Reserve a contiguous block of count longwords of 
memory. 

Reserve a contiguous block of count quadwords of 
memory. 


L Iaat 


When we want to allocate memory units and initialize them with particular values, 
we use statements of the form: 


List: 
last: 


.quad 


. quad oid 


123,987,42 
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The result in memory will look like this: 


: list 

: list+8 
: list+16 
: last 





On the other hand, if we only want to reserve memory units for a four-element data 
structure, but not to give them initial values, we could more simply use the Unix con- 
Struct 


. comm list, 3*8 
. comm last, 8 


or the OpenVMS construct 


Lie: .-blkq 3 
last: :plka 1 


Again here, the first memory unit has address list, the second list+8, etc. 

Value lists for the floating-point directives .s_floating and .t_floating 
support a standard form of scientific notation, e.g., 6.02E+23. Be sure to specify 
enough significant digits for quantities like pi or fundamental physical constants like 
the speed of light. 

In all instances, you should appreciate that the assembler is not a “typed” lan- 
guage. The storage directives merely allocate memory units and optionally associate a 
symbolic label with the lowest-order byte. Subsequently, you can freely access those 
memory units in other ways. For example, a string of 8 ASCII Spaces can be viewed as 
the quadword hexadecimal integer 2020202020202020. There is no run-time informa- 
tion to tell the hardware implementation how this information was defined, or how it 
should be treated, except to execute whatever instructions the programmer specifies. 


The Location Counter 


Programmers normally write blocks of statements where the implicit flow of control 
is intended to be sequential. Thus most types of instructions do not need to contain an 
address field for specifying which instruction should be executed next. If a computer 
is built on this assumption that instructions are to be executed in sequence, the 


I oes 
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instructions must be stored that way in memory. When the program counter is incre- 
mented automatically as one instruction is being executed (Figure 2.4), it then neces- 
sarily points to the next instruction in memory to be executed after the current one is 
completed. 

The assembler maintains a location counter, analogous to the program counter of 
the CPU, which keeps track of where to store the next instruction as the executable pro- 
gram is being constructed, line by line. Alpha assemblers employ additional location 
counters to construct the data regions for a program. The reason for multiple location 
counters is that sophisticated multi-user operating systems like Unix or OpenVMS can 
manage physical memory in such a way as to assure that program code is read-only 
(i.e., cannot corrupt itself), while data regions can be set up to be either read-only or 
read-write as appropriate to their intended purposes. 

The assembly process usually begins by initializing every location counter with 
the value zero. Since there cannot be more than one memory cell with an actual address 
of zero, it falls to the linker utility program to arrange the executable code region and all 
the data regions into some suitable overall order and, in effect, to add constants to most 
or all of those zeros. Thus the location counter values really specify address offsets rela- 
tive to the actual or physical origin of each region. 

As the assembler reads instructions or data specifications from the source pro- 
gram, it translates and outputs the equivalent binary patterns to the object file. The loca- 
tion counter is incremented appropriately. For every Alpha instruction, the location 
counter for executable code advances by 4 because all instructions are longwords. With 
the OpenVMS assembler, the .blkq directive defines the label (if present) on the 
source line as the current location counter value, and then advances that data location 
counter by 8 because a quadword is 8 bytes, or 8 address values, in length. That is how 
sq2 in SQUARES or SQUARES2 acquires an address (before and after the linking 
process) which is sq1+8. 

In the MACRO-—64 assembler for OpenVMS, the period (. ) symbolizes the cur- 
rent value of the location counter. Suppose we want to establish a 10-element data 
structure whose first element is self-referential, i.e., the value stored at location block 
is the address corresponding to the symbol block. We can accomplish this either with 


block: .address block ; refers to location ‘block’ 
.blkq 9 
or with 
block: .quad ; >. refers to location ‘block’ 


.blkq 9 


eee 
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Of these, the use of .address is preferable to . quad because it clarifies the true 
intent of the directive and because it allows the assembler to pass additional information 
to the linker. Note, moreover, the inequivalence of 


block: „address block ; refers to location ‘block’ 
-address block ; Still refers to location ‘block’ 
and 
block: .quad à ; refers to location ‘block’ 
. quad ; ; refers to location ‘block+8’ 


because the location counter has advanced by 8 units in providing for the first quadword 
at symbolic address block. 

The Unix assembler does not permit the location counter (.) to be used as an 
implicitly known symbolic value in these particular ways. 


Expressions 


Some of the power of a modern assembler program stems from its Support of expres- 
sions. An expression is a combination of terms joined by binary operators. The most 
useful binary operators used by the Alpha assemblers are defined in Table 3.4. 


Table 3.4 Assembler Arithmetic and Logical Binary Operators 
Binary 
Operator Example Meaning 
+ A+B Integer addition of BtoA 


— A -B Integer subtraction of B from A 


* * 


Integer multiplication of A by B 

Integer division of A by B 

Logical AND of A and B 

Logical OR of A and B (Unix) 

Logical OR of A and B (OpenVMS) 

Logical EXCLUSIVE OR of A and B (Unix) 
Logical EXCLUSIVE OR of A and B (OpenVMS) 


/ / 
& & 
| | 
| I 


A 


\ 


AN 


\ 


> pP P P p pp 
w Ww Ww Ww Ww Ww w 


A term can be a number, a symbol that has been given a numeric value, or a built-up 
expression. Any term may be preceded by a unary operator, either a radix control operator 
from Table 3.2 or one of the arithmetic and logical unary operators in Table 3.5. 
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Table 3.5 Assembler Arithmetic and Logical Unary Operators 


Unary Operator Example Meaning 
+ +A Results in the positive value of A 
- -A Results in the negation (two's complement) of A 
~ ~37 Results in the binary complement (one's complement) of 
37 (Unix) 
"~g s7 Results in the binary complement (one's complement) of 
37 (OpenVMS) 


i aaaaaaaaaaaaaaaaaaaaaaaasaaasasasasasasasasasasasasasasasssusussssslssssl— 


The Unix assembler considers the unary operators (Table 3.5) to have highest pre- 
cedence, the binary plus and minus operators to have lowest precedence, and all other 
binary operators to have intermediate precedence. These precedence rules are different 
from those of the C language. If you do not want A+B*C to be interpreted by default as 
A+(B*C), for example, you can explicitly specify (A+B) *C instead. Parentheses can 
be nested to clarify the order of evaluation for expressions that are more complicated, 
such as (A+ (B-C) * (D+E) ) /F. 

Somewhat differently, the OpenVMS assembler processes expressions from left to 
right, with no operator precedence. The order of evaluation can be changed (or made 
clearer to the reader) by using angle brackets (< >). If you do not want A+B*C to be 
interpreted by default as <A+B>*C, for example, you can explicitly specify A+<B*C> 
instead. Angle brackets can be nested to clarify the order of evaluation for expressions 
that are more complicated, such as <A+<B-C>*<D+E>>/F. 

Expressions can be used anywhere within a program where values would be legal, 
as in these OpenVMS assembly language examples: 


. Long 35*<7+6> 
sch arraysize*4 
length = rows*columns 


All symbols within an expression must be assembly-time constants or symbols that 
have already been defined. The evaluation of all terms and expressions for the Alpha 
occurs using 64-bit values. 


Control Statements 


We are using the concept of control statements to include both system-supplied routines 
that affect the assembly process and certain assembler directives that affect the behavior 
of the assembler itself. This latter category includes the OpenVMS .title and .end 
directives, which have the forms: 
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.title module name title phrase 
-sbttl subsection phrase 

{body of your program} 

.end label 


Although the . title is optional, it is considered good programming form to use it. 
The specified module name plays a role with other system software: the linker, the 
debugger, and the librarian (we do not discuss the librarian in this book). The module 
name and the title phrase both will appear on the top line of each page of the assembly 
listing file. 

If a program is long enough to have subsections, there is also a . sbtt1 directive 
whose subsection phrase will appear on the second line of each page of the assembly 
language listing. 

Although the Unix assembler does not use directives for titles or subtitles, com- 
monly accepted programming practice would call for a comment line at the top of a 
program module to identify it for readers. 

The .end directive informs the assembler program that this is the last physical 
line of text input. In the OpenVMS programming environment, the label is used to 
specify a transfer address at which execution is supposed to begin when the program is 
run. If an OpenVMS program is composed of several separately assembled or compiled 
pieces, only the “main program” should specify a transfer address. All other compo- 
nents should end with a bare . end statement. 

In the Unix programming environment, the program module named main consti- 
tutes the beginning point for execution. 

We intend to discuss Unix directives such as . ent and . prologue and Open- 
VMS system-supplied routines such as Sroutine and $end_routine more fully 
later. 


Elements of a Listing File 


The listing file produced by an assembler typically reproduces the source file, line by 
line, and shows in columns at the left such features as a line number for reference, the 
address at which the data or instruction will be placed (i.e., the location counter value), 
and the representation for the data or instruction. The order and formatting of such ele- 
ments can differ markedly for different assemblers (see Figure 1.2). 

The OpenVMS assembler listing file for SQUARES2 is shown in Figure 3.3. 
(Some spaces and tabs have been removed to fit the page size of this book.) Note that 
the location counter (LC) and generated code are shown in hexadecimal representation. 
The line number column may not be very helpful because it also reflects lines from Sys- 
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tem libraries that were considered during the processing of such macros as Sroutine, 
$return, and $end_routine. 


Code LC Line Source Program 
SQUARES2 Table of Squares (OpenVMS) 29-DEC-1997 15:02:45 MACRO-64 V1.1-087 Page 1 
29-DEC-1997 15:02:37 DISK: [DIR] SQUARES2 .M64 ; 2 


L .title SQUARES2 Table of Squares (OpenVMS) 
2 Sroutine squares2, data_section_pointer=true, - 
3 kind=stack, saved_regs=<r2,r15> 
0018 2130 Sdata_section 
00000008 0000 2131 sales .blkq 1 ; To store 1 squared 
00000010 0008 2132 saZ2k: .blkq 1 ; To store 2 squared 
00000018 0010 2133 sa32: plkg 1 ; To store 3 squared 
0018 2134 ; etc. 
0018 2135 Scode_section 
00000008 '0018 2136 . base r27,$l1s : R27 -> linkage section 
A5FBFFF8 0018 2137 ldq ¥i5,$ap > R15 -> data section 
00000000'001C 2138 . base r15,$ds : Tell MACRO to use this 
47E03401 001C 2139 first:: mov Lyrik ; R1 = first difference 
47E05402 0020 2140 mov na E? + R2 = second difference 
47E03400 0024 2141 mov 1, 20 ; RO = first square 
B40F0000 0028 2142 stq r0,sql > to be stored 
40410401 002C 2143 addq £2; FL, EL ; Adjust first difference 
40200400 0030 2144 addq ELEZO rO : RO = second square 
B40F0008 0034 2145 stq r0,sq2 > to be stored 
40410401 0038 2146 addq eo i Era al ; Adjust first difference 
40200400 003C 2147 addq ri ro, ro : RO = third square 
B40F0010 0040 2148 stq r0,sq3 ; to be stored 
0044 2149 ; etc. 
47E03400 0044 2150 done:: mov +,x0 ; Signal all is normal 
0048 2151 $return > Return to OpenVMS 
0060 2234 Send_routine squares2 ; Needed by $routine 
0060 2250 .end squares2 ; Set start address 


Total lines assembled: 5306 


Command: MACRO/ALPHA/DEBUG/LIST SQUARES2 
Figure 3.3 SQUARES2 listing file (OpenVMS) 


Some assemblers also print a symbol table at the end of the listing file, giving the 
numeric value for each symbol introduced in the assembly language program. These 
may be marked with designations that show which ones are absolute (defined with a 
constant value), relocatable (subject to an additive constant at link time), or global 
(accessible to other routines, and shown in a link map). The OpenVMS assembler for 
Alpha does not produce a symbol table in its listing file. 

The Unix assembler for Alpha does not produce listing files, but the nm command 


can print a table of symbols from an executable file. 


mmc a aO 
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The Assembly Process 


Having established the basic syntax and semantics of assembly language, we will now 
give an overview of the process of assembling a program. Most assemblers are said to 
be two-pass assemblers; that is, they make two complete passes through the text file that 
expresses the source program. Two passes are required because all symbol values must 
be known before final translation into a binary machine language program can begin. 
As we have seen, the assembler is supposed to work through the programmer’s 
symbolic source program and produce the binary object program and perhaps a listing 
file. This task requires that the assembler program be able to perform three functions: 


1. remember the addresses and values of the symbols found in declarations and 
instructions; 

2. translate the symbolic operation names into binary opcodes; and 

3. convert the symbolic operand specifiers into their proper binary forms. 


Both passes through the source program are intimately connected with the concept of a 
symbol table that ultimately contains every programmer-defined symbol. An assembler 
also contains internal tables of permanently known symbols, such as the names for reg- 
isters, the assembler directives, and the mnemonic opcodes. 

On the first pass, every new symbol is entered into the symbol table. As values for 
symbols become known while the assembler is passing down through the source code, 
those values are entered into the symbol table. The assembler remains alert to the possi- 
bility that the program may contain the flaw of a multiply-defined symbol. By the end 
of the first pass, every symbol should either have a value or be known to be an external 
symbol (sometimes also called a global symbol). Some assemblers report a list of unde- 
fined symbols at the end of the first pass, while others assume that such symbols are 
intentional (i.e., not the result of a typographical error) and therefore external. 

External symbols are resolved during the linking process (described later). In 
order to enforce its syntax and semantics, an assembler also has to classify the program- 
mer’s symbols by type. At the very least, this means noting whether they were defined 
using an equal sign (symbolic value) or a colon (symbolic address label). 

At the end of the first pass for SQUARESZ2, the internal symbol table prepared by 
the assembler would contain the labels sql, sq2, sq3, firs t, and done. Their values 
are the location counter values that you can see in the listing file (Figure 3.3). In this par- 
ticular example, there are no forward references, 1.€., NO Occurrences where a symbol is 
used before it has been defined explicitly or implicitly. SQUARES2 could be successfully 
assembled by a one-pass assembler, but this is merely fortuitous. The values of the sym- 
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bols first and done are location counter offsets from the origin of the code section, 
while sq1, sq2, and sq3 have values relative to the origin of the data section. 

The OpenVMS $routine directive defines these sections as well as a linkage 
section that contains the module name squares2 and the system-supplied symbols 
ċls and $dp. The OpenVMS assembler may also define other symbols from the 
ċroutine macro, such as the symbol $ds, which has a value relative to the origin of 
the data section. 

On the second pass, the object and listing files are produced. Expressions are eval- 
uated using values from the symbol table, and the resulting data values or instructions 
are formatted as the appropriate binary numbers and designated for the correct storage 
addresses. Thus as each line of the source program is interpreted, these output files 
grow incrementally. Diagnostic error messages are also written into the listing file, and 
severe errors may abort further generation of the object file. A comment about dealing 
with errors: Always give prime consideration to the first few error messages; fixing the 
problem that gave rise to those may clear up many subsequent messages too. 

Expressions containing certain combinations of symbols, including external refer- 
ences, cannot be fully resolved by an assembler. In such instances, information about 
the expressions is forwarded to the linker by way of the object file. 


The Linking Process 


In a programming environment such as OpenVMS for the Alpha, the assembler or high- 
level language compilers never produce a program that is complete in itself and ready to 
run. Instead, a linker program is provided as part of the system software. The principal 
tasks of the linker require that it be able to perform three functions: 


1. build a table of all external symbol names found in all object files; 

2. gather values for every symbol, including references to system-supplied routines 
and special values; and 

3. adjust certain temporary, relative addresses to final specific memory locations. 


Like most assemblers and compilers, a linker may also require at least two passes 
through all the object files. 

On the first pass, symbols are gathered into a composite symbol table for the 
entire program. If any external symbols still lack values at the end of this first pass, one 
or more system libraries are consulted. The production of an executable program may 
be aborted in the event that any symbols remain unresolved. 
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A linker also tabulates the total lengths of the various code and data sections, i.e., 
the maximum value that the location counter for each section had reached. After the 
first pass, the linker arranges the code and data sections end to end into a linear 
sequence, perhaps alphabetically or perhaps in an order requested by the programmer. 
Then final values for symbols can be established. 

For the second pass, space for the executable image file is allocated in the pro- 
grammer’s data storage area on disk. This file is then treated as a giant array of memory 
units. As the linker rereads all the object files, it distributes each item encountered into 
a storage unit at the correct address. Some linkers offer the potential to initialize data 
elements, e.g., to a value such as zero, but cautious programmers never depend on such 
behavior. 

If the programmer requested a map file, that is produced also. The map file con- 
veys a human-readable tabulation of what is stored where. Some linkers offer other fea- 
tures such as choices of sorting order for symbols, e.g., by name or by value. Figure 3.4 
is adapted from portions of an OpenVMS link map for SQUARES2. 

Notice that the OpenVMS linker has positioned the linkage section at virtual 
address 10000, the code section at address 20000, and the data section at address 
30000. This illustration shows why we had to inspect addresses starting at 30000 
using the EXAMINE command to find the results computed by the SQUARES pro- 
gram in Chapter 1. 

The Unix linker implements an entirely different convention for assigning final 
virtual addresses, putting the text section of SQUARES2 at address 0x1200010d0 and 
the data section at 0x140000000. Although the Unix linker does not produce link maps 
itself, memory allocation information in the Unix programming environment can be 
ascertained using a command of the form nm -x Squares2 to show some of the 
symbolic reference information stored as part of the executable file (recall the -g flag 
used on the cc command). 
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29-DEC-1997 15:02 


Module Name 


SQUARES2 
t5802 


Psect Name 
SLINKS 
SCODES 


SDATAS 


Module Name 


SQUARES2 


SQUARES2 


SQUARES2 
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Linker A11-39 


Bytes File Creation Date 
160 SQUARES2.OBJ;1 29-DEC-1997 
+-------------------------- + 


+-------------------------- + 
Base End Length 
00010000 00010027 00000028 ( 40.) 
00010000 00010027 00000028 ( 40.) 
00020000 0002005F 00000060 ( 9s) 
00020000 0002005F 00000060 ( 96.) 
00030000 00030017 00000018 ( 24. ) 
00030000 00030017 00000018 ( 24.) 
4+----------------- + 


above: 


Symbol Value 
DONE 00020044-R 
FIRST 0002001C-R 
SQl 00030000-R 
SQ2 00030008-R 
SQ3 00030010-R 
SQUARES2 00010008-R 
Key for special characters 
eee ae eer ee 
1 * - Undefined 
! A - Alias Name 
! I - Internal Name ! 
! U - Universal 
! R - Relocatable 
1 X - External 
! WK - Weak 
ee ee 


LINK/DEBUG/MAP SQUARES2 


Figure 3.4 Information from SQUARES? link map (OpenVMS) 
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The Program Debugger 


Debugging programs written at the very low level of an assembly language presents a 
greater challenge to the programmer than would be the case for writing in high-level 
languages, in part because of the greater effort required to insert temporary diagnostic 
print statements here and there. Recognizing this difficulty, Digital Equipment Corpora- 
tion and other system manufacturers typically distribute various debugging aids as part 
of the system software. 

Several debugging aids are available for the Digital Unix programming environ- 
ment, but we use the simplest, the dbx debugger that has capabilities paralleling those 
of the OpenVMS symbolic debugger, including the capability to debug programs writ- 
ten in high-level languages as well as assembly language. The debuggers furnished for 
the Unix and OpenVMS programming environments are especially powerful tools, 
described at considerable length in the manuals cited as references for this chapter. We 
present only a small subset of the most useful commands and techniques that we feel 
are especially appropriate for learners. 

Preventing bugs and logic flaws completely, through careful design and reasoning 
about the problem to be solved, is a laudable goal for all programmers. Nevertheless, 
few people develop final working programs without passing through an awkward 
debugging phase to track down and fix inevitable errors. Learning to use at least the 
simplest capabilities of a debugger can greatly increase a person’s productivity. 


The Unix and OpenVMS debuggers provide interactive environments that will 
allow you to experiment with your programs as you develop them. You may even be 
able to trace the course of a routine using modified input data without going back 
through the entire edit-assemble-link-test cycle of program modification for small or 
conjectural changes. Furthermore, we will illustrate using the debugger as an easy tech- 
nique for obtaining simple formatted output from the brief didactic programs in the next 
few chapters. 


Capabilities of Debugger Programs 


A debugger allows you to execute your program interactively through the control capa- 
bilites of the debugger program itself. A typical debugger includes facilities such as 
these: 


° State examination and modification. You can examine variables and change their 
values at will. This can save development time, since you can try a simple modifi- 
cation manually in the middle of program execution and then allow the program to 
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continue. Sometimes you can efficiently try two or three small changes, and then 
make only one edit to the source. 

e Execution control. You can restrain your program from free-wheeling execution 
by stepping along, one or a few statements at a time. You can modify the flow of 
execution by skipping instructions, by branching to a different point in the pro- 
gram, or by calling procedures. 

e Breakpoints, watchpoints, and tracepoints. You can define a breakpoint at a spe- 
cific statement where execution should be paused in order to allow you to examine 
registers or memory locations. You can define a watchpoint on a variable in order 
to pause automatically whenever that variable is about to be modified at some 
unknown place in your program. You can define a tracepoint for a statement or 
event for which you would like to have some diagnostic message printed on every 
encounter. 


The debuggers for the Alpha provide support for such capabilities. Some of the specific 
debugger commands are listed in Table 3.6. These debuggers offer considerable built-in 
assistance through an interactive help command. 


Table 3.6 | Selected Commands for the Alpha Debuggers 


dbx (Unix) OpenVMS debugger Action taken 

Searching the source program 

/ TEXT search 1 TEXT Searches for text string in source file 

/ search Searches for next occurrence of 
same 

Examining and modifying memory locations 

ADDR/X examine ADDR Displays a quadword value (hexa- 
decimal) 

ADDR/D examine/dec ADDR Displays a quadword value (deci- 
mal) 

ADDR/xx examine/long ADDR Displays a longword value (hexadec- 
imal) 

ADDR/dd examine/long/dec Displays a longword value (decimal) 

ADDR 

ADDR/i examine/instr ADDR Interprets as a machine instruction 

ADDR/nc examine/ascii:n ADDR Shows as a string of n characters 

ADDR/s examine/asciz ADDR Shows a null-terminated string 


assign ADDR=EXP deposit ADDR=EXP Changes a stored quadword value 
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Table 3.6 Selected Commands for the Alpha Debuggers (continued) 


dbx (Unix) OpenVMS debugger Action taken 

Examining and modifying variables and registers 

DX. BEP pas examine EXP... Prints evaluated expression (hexa- 
decimal) 

pa EXP xc: examine/dec EXP,... Prints evaluated expression (deci- 
mal) 

assign VAR=EXP deposit VAR=EXP Changes a register or memory vari- 
able 

Flow control 

run go Starts execution of the program 

stepi n step/instr n Executes only n machine instruc- 
tions 

conti go Resumes execution to next break- 
point 

quit quit Exits from the debugger 

Breakpoints, watchpoints, and tracepoints 

stop at LINE Puts a breakpoint at the chosen line 

set break LABEL Puts a breakpoint at the labeled line 
stopi VAR set watch VAR Sets a watchpoint for modification of 


the variable 
tracei at LINE set trace line LINE Sets a tracepoint for the chosen line 


delete all cancel all Removes all breakpoints, watch- 
points, and tracepoints 


Conditional action 


when at LINE {..} set break .. do Econ) Performs commands in {...} or A 
whenever breakpoint is reached 
when VAR {...} set watch VAR do (..) Performs commands in E E HOE a 


whenever variable changes 


set trace .. do (a) Performs commands in {...} or (...) 
whenever tracepoint is encountered 


ee TO 


The debugger may use different naming conventions for the hardware registers RO 
... R31 and FO ... F31. The dbx debugger expects the names $ro ... $x31 (in contrast 
to $0 ... $31 in the assembly language source file) and $£0 ... $£31. The facilities of 
the Unix programming environment are thus not as uniformly consistent in register 
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nomenclature as those of the OpenVMS programming environment, where the assembler 
and the debugger both accept the hardware register names (in lower case or upper case). 
The next two sections illustrate a few of the debugger commands for the Unix and 
OpenVMS programming environments. Study most carefully the one of these corre- 
sponding to the system type for which you may have access and/or experience. 


Running SQUARES2 using dbx (Unix) 


A debugger attains its fullest capabilities when it has access to all the symbolic names 
that you have used in writing your program. Executable files ordinarily omit this 
detailed information unless you use the -g flag when assembling and linking by way of 
the cc command. (The dbx debugger for the Unix programming environment does 
also have a limited capability to control a program that was not linked with symbols 
included. ) 

A successful assembling, linking, and debugging session for the SQUARES? pro- 
gram would proceed as follows (spaces and tabs have been adjusted for a clearer pre- 
sentation): 


> cc -g -00 squares2 squares2.s 
> dbx -i squares2 

dbx version 3.11.10 

Type 'help' for help. 


main: 17 first: mov 1,$1 # R1 = first difference 
(dbx) /done 
29 done: mov $31,$0 # Signal all is normal 


dbx) stop at 29 


( 

[2] stop at "squares2.s":29 

(dbx) run 

[2] stopped at [main:29 ,0x1200011cc] done: mov 
S31, 30 # Signal all is normal 

(dbx) pd sql,sq2,sq3 

1 4 9 

(dbx) quit 

> 


The -i flag on the dbx command line invokes the debugger in interactive mode. The 
executable file squares2 and the source file squares2.s both need to be in the 
current working directory and, of course, must correspond to the same version of the 
program. | 

A breakpoint is set at the line marked with the label done. The program is then 
allowed to proceed from the top down to the label done. The printed answers are 
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clearly correct for the algorithm that computes a table of squares of the first few 
integers. 


Running SQUARES2 using the OpenVMS Debugger 


A debugger attains its fullest capabilities when it has access to all the symbolic names 
that you have used in writing your program. Executable files ordinarily omit this 
detailed information unless you use the /debug qualifier when assembling and again 
when linking. (The OpenVMS debugger does have a limited capability to control a pro- 
gram that was not linked with symbols included; the command is run/debug PROG.) 

A successful assembling, linking, and debugging session for the SQUARES2 pro- 
gram would proceed as follows (spaces and tabs have been adjusted for a clearer pre- 
sentation): 


S$ macro/alpha/debug/list squares2 (/list is optional) 
S$ link/debug/map squares2 (/map is optional) 
$ run squares2 

OpenVMS Alpha DEBUG Version V7.1-03R 


*®DEBUG-I-INITIAL, Language: MACRO64, Module: SQUARES2 


DBG> set break done 


DBG> go 
break at SQUARES2\%LINE 2150 
2150: done:: mov 1,r0 ; Signal all is normal 


DBG> examine/decimal sql:sq3 


SQUARES2\SQ1: 1 
SQUARES2\SQ2: 4 
SQUARES2\SQ3: 9 
DBG> quit 

$ 


The executable file squares2 .exe and the source file squares2.m64 both need 
to be in the current working directory and, of course, must correspond to the same ver- 
sion of the program. 


A breakpoint is set at the label done. The program is then allowed to proceed 
from the top down to the label done. The printed answers are Clearly correct for the 
algorithm that computes a table of squares of the first few integers. Notice the use of a 
colon to designate a range of successive locations with the examine command. 
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Examples of Debugger Commands 


We present here some further illustrations of debugger commands that may be useful to 
you in your own debugging efforts. These scarcely scratch the surface of what iS possi- 
ble to accomplish using a debugger. Use the built-in help command or consult the 
vendor manuals for additional information. 


Tracepoint example 


We could have used a tracepoint to follow the progress of SQUARES2 by noting that 
register RO successively takes on the values of the computed squares. Here are suitable 
command sequences for the Unix programming environment: 


> dbx -i squares2 
(dbx) tracei $r0 
(dbx) run 


and for the OpenVMS programming environment: 


S run squares2 
DBG> set trace r0 
DBG> go 


The debugger will step through the program and print out the old and new values in reg- 
ister RO at each of the program lines where RO is altered. 


Watchpoint example 


We could have used a watchpoint to determine when the execution of SQUARES? has 
stored a value for sq3. Here are suitable command sequences for the Unix program- 
ming environment: 


> dbx -i squares2 
(dbx) stopi sq3 
(dbx) run 


and for the OpenVMS programming environment: 


S$ run squares2 
DBG> set watch sq3 
DBG> go 
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The debugger will execute the program instructions stepwise and pause when the speci- 
fied variable sq3 has been modified. Although seemingly trivial here, this capability of 
watchpoints can be invaluable with a bigger program when you may have developed a 
psychological “blind spot” and cannot see how or where a quantity is being modified, 
perhaps erroneously and unintentionally. 


Using the screen mode (OpenVMS) 


The OpenVMS debugger has a very versatile full-screen mode that works with VT ter- 
minals or emulations like Telnet. To begin full-screen mode, give the debugger com- 
mand set mode screen or press the PF3 key above the numeric keypad. Then it 
may be useful to establish another window that displays critical register contents. For 
example, the command 


display reg at rhl do (examine rX:rY) 


produces a window region named reg with a size equal to the “first half” (i.e., top half) 
of the right half of the screen. Every time a step is executed or a pause occurs, the do 
clause displays the values in registers rX through rY. The show window/all com- 
mand will list many more of the possible screen-segmenting terms like rh1. 


Use of the several other debugging utilities for the Unix programming environ- 
ment lies beyond the scope of this book. 


Comparing Variants of a Source File 


As we have repeatedly implied, assembly language is not very “friendly” in the sense 
that the programmer must attend to very small details that compilers for high-languages 
perform very nicely behind the scenes. For most of us, spotting tiny differences among 
already small details is rather difficult for our eye-brain coordination to do well. 


Many programming environments provide some sort of standard utility program 
for comparing two text files that may have some differences between them, but lots of 
similarities overall. The utility program reads through both versions and shows us those 
lines where differences occur and the contexts in which whole lines may have been 
inserted or deleted. Here are the simplest forms of commands for such comparisons for 
the Alpha programming environments: 


> Gift fileli files (Unix) 
$ differences filel file? (OpenVMS) 
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In both cases, there are numerous options and variations of the basic commands shown 
here, and you may wish to consult system documentation or use on-line help (man 
diff orhelp differences). 

It is not at all uncommon to get a program working nicely and then make “just a 
couple of tiny changes” only to find that it then performs badly. Rather than scan 
through manually, and perhaps forget that you made five “tiny changes” instead of two, 
we think you will find these commands well worth remembering and using. 


Conventions for Writing Programs 


Individuals differ in their styles of programming. A particular individual may even use 
somewhat different styles when programming in different computer languages. The 
characteristics of programming style include: 


e the way you structure or subdivide a program, 

e the algorithms you use to solve common problems; 

e the instructions or language constructs you choose, from among all possibilities; 
e the format and overall appearance of your source program; and 

e the type of commenting that you incorporate in a program. 


To be sure, bookshops and libraries contain numerous computer science books on pro- 
gramming style; those can teach you how to structure or format programs, as well as 
what language constructs to use or to avoid. While not repeating such readily available 
information, we do wish to make a few general comments about assembly language 
programming specifically. 

As we have implied previously, few programs are written entirely in assembly lan- 
guage any longer. The exceptions are some critical systems programs and—very impor- 
tantly, we think—those written for the purpose of learning as you are doing. We 
strongly urge attention to the following goals: 


1. The program must solve the problem at hand. 

2. The program must convey both the technique of solution and the details of imple- 
mentation to the reader in a clear fashion. 

3. The program must be easy to understand and to modify. 


These goals require that the program contain good internal documentation. Choose 
plain methods for your algorithms and instruction sequences, not arcane tricks. 
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Debugging will be facilitated by using quadwords, or sometimes longwords, to 
contain most integer quantities. Although conciseness is admirable, computer memory 
is cheaper than human labor. Packing data into the smallest possible information units 
could require unreasonably long instruction sequences for unpacking and manipulation. 
Byte- and word-length information units are not supported naturally by most basic 
Alpha instructions. 


Both in a learning context and in real-world program development, the comments 
become nearly as much a part of the program as the instructions. While comments are 
not executable, they are essential for a program to have a long useful life because it can 
then be readily maintained by persons other than the original programmer. In assembly 
language, virtually every line should have a comment; moreover, those comments must 
be meaningful. A comment that merely echoes the opcode and specifiers has no value 
because the reader already knows that information. A comment adds value when it 
explains why a particular instruction is used in a particular place in a sequence. Often 
the comments on adjacent lines should work together to tell a story. Blocks of comment 
lines with no instructions can introduce a routine or separate a large effort into smaller 
units for better comprehension by the reader or maintenance programmer. 


All of the sample programs in this book will illustrate a style where comments are 
quite concise indeed but nonetheless convey as clearly as possible what is going on. 
This “less is more” approach allows printing the programs in portrait instead of land- 
scape orientation on the pages of this book and also permits displaying the programs in 
a classroom setting from a projected computer display or transparency. (We learned this 
technique from the uncharacteristic and utterly mythical law firm of Pithy, Trenchant, 
and Germane.) 


Summary 


The development cycle for an assembly language program involves the use of numerous 
system utility programs. Typically these include a text editor, assembler, and linker to 
produce an executable program file. We have discussed the symbolic assembler in the 
most detail, because it mediates between the human programmer and the machine lan- 
guage for a computer architecture. When you run a newly constructed program to test it, 
the results may not be as expected. A symbolic debugger can help speed the rectifica- 
tion of oversights by assisting you in stepping through the program and showing the 
values of intermediate quantities. Both the Unix and OpenVMS system software have 
been illustrated for the Alpha architecture with closely similar versions of the program 
for computing a table of squares. 
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EXERCISES 


3.1 Give an example of a simple statement in a high-level language and its assembly lan- 
guage equivalent. 


3.2 Why are special symbols such as “:” and “; ” or “#” used in writing an assembly 
language statement? 


3.3 Explain all the changes that would be involved in rewriting SQUARES2 to use long- 
word instead of quadword storage. 


3.4 (OpenVMS only) Are there any printing ASCII characters that would not work as 
matching delimiters with the “a unary operator? (You can usually test your thinking 
on questions like this by direct experimentation: write a tiny TEST.m64 file and see 
how MACRO-64 reacts towards it.) 


3.5 What are the advantages of equating a value to a symbol that can be used throughout 
a program? 


3.6 Write the syntax for specifying the value forty-two using several radix values for (a) 
the Unix assembler or (b) the OpenVMS assembler. 


3.7 What hexadecimal numbers are stored in memory for the following statement? 
a. .byte 156, 128, 3, 0x29, 0123, 3+5/4 (Unix) 
a. .byte 156, ^d128, 3, “x29, %ol23, 3+5/4 (OpenVMS) 


3.8 Ifthe symbols PAR1 and PAR2 have been equated to 7 and 9, respectively, how will 
the expression 8* PAR1-PAR2 /3 be evaluated by the assembler in (a) the Unix pro- 
gramming environment and (b) the OpenVMS programming environment? 


3.9 Show what will be stored where in memory if the following lines are assembled in 
the order shown. Assume that here is at location 30000 (hexadecimal). List all sym- 
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bols defined by these directives, their values, and whether they are absolute or rela- 


tive. 

here: -quad 102 
there = 104 
this: squad 12; 34 
thing = 200 


3.10 Determine whether any permutations of 2 or 3 unary arithmetic and logical operators 
out of (a) the set {-, ~, Ox} for Unix or (b) the set {-, ^c, ^x} for OpenVMS 
applied to the number 10 are valid. For example, is -^x^c10 acceptable to 
MACRO-64? Do parentheses (Unix) or angle brackets (OpenVMS) help out at all? 
Summarize your findings, including a statement about precedence. Hint: Use the 
. quad directive. 


3.11 Why do assemblers typically make two straight-through passes, instead of moving 
back and forth within the text of an assembly language source program in order to 
resolve the values of symbols that are referenced at an earlier point than where they 
are actually defined? 


3.12 Explain why a linker program must usually make two passes through all of the object 
files that are brought together into one executable program. 


3.13 Adapt the program SQUARES2 to compute the cubes of the first five integers with- 
out using any explicit multiplication. Hint: An algorithm for N3 can be discovered by 
writing down the series 1, 8, 27, 64, ... and then inspecting the pattern of first, sec- 
ond, and third tabular differences. (For OpenVMS, change the saved_regs speci- 
fication to reflect the number of registers that you need for intermediate results. Store 
only the function values in memory locations.) 


3.14 Extend the SQUARES2 program to compute a list of values of one of the following 
polynomials for integer values of N from 1 through 5, again without using explicit 
multiplication instructions. 


a. N24N 
b. 2N2~-1 
ò N2-N+2 


The Alpha instruction subq Rx, Ry, Rz subtracts the contents of Ry from the con- 
tents of register Rx and places the 64-bit difference in Rz. (For OpenVMS, change 
the saved_regs specification to reflect the number of registers that you need for 
intermediate results. Store only the function values in memory locations.) 





CHAPTER 4 


Alpha Instruction 
Formats and 
Addressing 


The designers of any new computer architecture face 
many decisions. Their general approaches and their choices of details not only affect 
the potential success of the first implementation brought to market but may also predes- 
tine the evolution all future models. The successes and shortcomings of earlier architec- 
tures from all manufacturers provide a general backdrop that design teams dare not 
forget, because numerous influential critics (including the financial analysis commu- 
nity) will inevitably draw such comparisons. More interestingly, fundamental research 
on new architectural principles may provide novel ideas that must be evaluated because 
developments in technology should be anticipated to become available for new imple- 
mentations over the course of an architecture’s life span. 


How do designers decide upon the width (in bits) for instructions? Will all instruc- 
tions have the same width (which leads to simplicities of design), or will several classes 
of instructions with different widths be included in the architecture (which may drive up 
the costs of implementation)? Will programmers and compiler writers actually use all 
of the instruction types that an architecture supports? Are there any instruction types 
that are perceived to be truly essential, or others that are only marginally useful? What- 
ever the answers selected, one mathematical truth always rules: with w bits, there are 
only 2” unique bit patterns for all imagined uses in the encoding of instructions. 


In the next few paragraphs we return to the superfamily concept from Chapter 1 as 
applied to the evolution of various architectures put forth by Digital Equipment Corpo- 
ration, as but one family history that could be told. 
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The designers of the PDP-11, who were striving to design a minicomputer archi- 
tecture whose implementations would deliver a substantial fraction of then-contempo- 
rary mainframe performance at a minor fraction of mainframe cost, decided on a mixed 
strategy: the instruction proper (opcode and all register references) would occupy 16 
bits, but certain addressing modes would involve either one or two additional 16-bit 
words used as address offsets. Those choices resulted in 7 opcodes for two-address 
instructions, a much larger number of one-operand instructions, and ample provision 
for zero-operand instructions. 


The designers of the VAX worked at a time before it could be safely anticipated 
that semiconductor memory would become almost indefinitely scalable through greater 
capacity at ever-declining costs. As they hoped to design for high performance at a 
moderate price, the still-high cost of memory at a crucial early time in the design cycle 
led to decisions strongly biased toward compact instructions and efficient data storage 
using numerous sizes of information units (bytes, words, longwords, etc.). VAX instruc- 
tions are variable-length, averaging about 4 bytes. Thus only slightly more memory is 
required to express a VAX program than an equivalent PDP-11 program. Any additional 
memory can be allocated to data, not to the program. Because one byte is used for the 
opcode, with all register references appearing in subsequent bytes, the VAX architec- 
ture provides a great many more multiple-operand instructions than its 16-bit predeces- 
sor. Some of those are so complex and specialized that they never found their way into 
much use, even by the maker’s own compiler writers. 


Much had changed by the time that Digital Equipment Corporation put together a 
design team for what has become the Alpha architecture. The cost of memory no longer 
predominated over other implementation costs. The research community had seriously 
challenged the hegemony of Complex Instruction Set Computers (CISC), exemplified 
by the VAX and the Intel 80x86 series. Other manufacturers like MIPS® and Sun® were 
having success, especially for high-performance scientific workstations, with Reduced 
Instruction Set Computers (RISC) using 32-bit datapaths and uniformly sized 32-bit 
instructions. Those 32-bit RISC architectures had evolved from advanced research on 
RISC principles. 


The Alpha was the first commercially successful architecture with 64-bit data- 
paths (other major manufacturers followed a few years later). But a RISC architecture 
does not necessarily need 64 bits for instruction encoding, since the RISC idea strongly 
advocates that there be fewer opcodes than for the VAX and dramatically fewer 
addressing modes. Therefore the Alpha uses only 32 bits to express each instruction. 


In this chapter we outline the structure of Alpha instruction types and the ways to 
access stored data. We also begin to discuss the design, operation, and applications of 
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Alpha instructions one group at a time. Later chapters take up additional groups of 
instructions. 


Overview of Alpha Instruction Formats 


The formats of Alpha instructions can be classified according to the number of register 
operand fields, which ranges from three to zero (Figure 4.1). The opcode field is uni- 
formly 6 bits wide (bits 31:26) for each of these classes. An opcode value always 
implies the particular class to which the instruction belongs. The determination of 
instruction class, in turn, governs how the remaining bits are to be interpreted. 


e Operate instructions support integer arithmetic, integer logical and shift opera- 
tions, and floating-point arithmetic. Generally, registers Ra and Rb contain source 
data or parameters, and register Rc is the destination for the result. 

e Load and store instructions transfer longwords or quadwords between register Ra 
and memory, using Rb and the displacement for specifying the address of the 
information unit in memory. 

e Branch instructions conditionally alter the program flow based on a test of the 
current contents of register Ra; the displacement adjusts the program counter if 
the branch is to be taken. 

e PALcode instructions are essentially “system calls” leading to sequences of 
instructions in firmware or auxiliary software libraries; a few of these are operat- 
ing system independent, while others are operating system specific. PALcode 
allows the basic Alpha architecture to remain simple, while providing special fea- 
tures needed by different operating systems (Unix, OpenVMS, Windows NT). 


Although the Alpha architecture has only four principal classes of instructions, a few 
special cases based on these patterns provide additional capabilities. 
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Class Registers Opcodes* 
31 26 25 21 20 16 15 5 4 0 
[Opcode[ Ra [ Re [ Function | Re | Operate 3 7 
Reserved (PALcode) 5 
Reserved (Digital) 13 


* Distribution of the usage of opcodes for the first Alpha implementation (21064 chip). 


Figure 4.1 Major classes of Alpha instructions 


The 6-bit Alpha opcode values range from 00 to 3F (hexadecimal). Since these 6 
bits are at positions 31:26, the values will actually appear to range from 00 to FF in the 
instruction field of an assembly listing (depending also on the two most significant bits 
of the Ra field): 


Appearance Binary 6 bits Actual Opcode 
00 through 03 0000 00xx 00 0000 00 
04 through 07 0000 O1xx 00 0001 01 
08 through OB 0000 10xx 00 0010 02 
OC through OF 0000 11xx 00 0011 03 
FO through F3 1111 00xx 11 1100 3C 
F4 through F7 1111 Olxx 11 1101 3D 
F8 through FB 1111 10xx 11 1110 3E 
FC through FF CELL LIEK EL LILI fo 1 


That is, you can take the two leftmost bytes of the instruction expressed as a binary 
number, discard two bits at the right, and regroup the remaining six bits in order to 
obtain the actual opcode value. (We give an example in the next section.) 


Computer designers seldom assign opcode values randomly. Particularly for RISC 
designs, thoughtful regularity in assignment of opcode values makes possible certain 
efficiencies in the layout of the circuitry on the chip responsible for executing each 
class of instructions and, indeed, each individual instruction. For example, the Alpha 
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opcodes within pairs of complementary branch instructions differ by just one bit: equal/ 
not equal, less than/greater than or equal, greater than/less than or equal. 

Notice that over 25% of the possible opcode values were “reserved” (Figure 4.1) 
and had no defined purpose in early implementations of the Alpha architecture. A pre- 
cedent occurred over the lifetime of the PDP-11 architecture. An original model lacked 
direct hardware support for floating-point operations, which were added later using pre- 
viously reserved opcode values. Ultimately a few models also had a “commercial 
instruction set” especially designed to support constructs in the COBOL programming 
language. Typically, any such evolutionary extension of an architecture will be designed 
as a Strict superset of the fundamental architecture. Thus programs assembled or com- 
piled for the earliest models will run on later models, but the converse is not true unless 
software or firmware emulation is added. Attempting to execute a “reserved” instruc- 
tion generally produces a hardware fault that will be intercepted and interpreted by the 
operating system software. 


Integer Arithmetic Instructions 


We elect to describe integer arithmetic instructions, which are part of the class of oper- 
ate instructions, as the first systematic discussion of the Alpha instruction set. 


Addition, Subtraction, and Multiplication 


We have already somewhat intuitively made mention of the multiplication instruction 
and the addition and subtraction instructions in previous chapters: 


addt Ra,Rb,Re ; Rc <— Ra + Rb 
subt Ra, Rb, Rce ; Rc <— Ra - Rb 
mult Ra; RD RC ; Rc <— Ra * Rb 


where the fourth character of the opcode mnemonic denotes the type or size of the 
information unit: t = 1 for longword operands and t = q for quadword operands. In 
longword operations, the high order 32 bits of source registers Ra and Rb are ignored; 
the sign bit is extended from bit 31 throughout the high order bits of destination register 
Rc in order to produce a result that can be subsequently used either in longword or 
quadword operations. Each operand is found by direct addressing in the standard forms 
for add, sub, and mul just given. That is, the assembler will produce an instruction 
word containing the 5-bit numerical address for each named register. 

These three instructions (add, sub, mu1) also support literal addressing for the 
second operand instead of register Rb: 
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addt Ra, Lit.Re > Rc <-- Ra + lit 
subt Ra, lit,Re > Re <-- Ra - lit 
mult Ra,lit,Re ; Re <-- Ra * lit 


where 1it is a byte-length unsigned constant in the range 0 to 255. That is, there are 
actually two forms of these (and many other) integer operate instructions, as shown in 
Figure 4.2. If bit 12 within the function field of bits <15:5> is zero, then bits <15:13> 
should also be zero (reserved for future use by Digital) and the second operand is a reg- 
ister whose name (number) is in bits <20:16>. If bit 12 in the function field is one, then 
bits <20:13> specify the 8-bit unsigned constant. 


31 26 25 21 20 16 15 13 12 TI 5 4 0 
31 26 25 2i 20 13 ie TI 5 4 0 


Figure 4.2 Register and literal forms of integer operate instructions 


The MACRO-—64 assembler for OpenVMS directly supports this literal addressing 
as just described. The Unix assembler treats mul as a pseudo-instruction and may substi- 
tute short sequences of the scaled arithmetic instructions when the second operand is a 
small integer rather than a register specification. We return to this point at the end of this 
chapter, where we show what the assembler may do. Much later, in Chapter 11, we revisit 
this point to suggest why such substitutions may produce efficiencies in a program. 

The decomposition and identification of a binary Alpha instruction is best left to 
commands of the debugger. Nevertheless, we now give one example of manual interpre- 
tation. Consider the instruction at line 2146 in Figure 3.3: 


4 0 4 e) 0 4 0 i (hexadecimal) 
0100 0000 0100 0001 0000 0100 0000 0001 (binary) 
01 0000 0 0010 0 0001 000 0010 0000 0 0001 (bit fields) 

10 02 01 020 01 (hex fields) 
opcode Ra Rb function code Rc 


The opcode of 10 (hex) identifies the instruction as some type of addition or subtraction 
in the integer operate group and establishes that the remaining bits should be interpreted 
according to the partitions of bit fields in Figure 4.2. The function code of 020 (hex) 
clarifies that the opcode is addg. The full instruction is finally seen to be addq 
R2,R1,R1. A similar process could be used for any other instruction in Figure 3.3. 


Integer Arithmetic Instructions 85 


The instructions for integer addition and subtraction share opcode 10, while the 
instructions for integer multiplication use opcode 13. The function codes in bits <15:5> 
of these instructions distinguish between longword or quadword operands and denote 
certain other variations of these instructions. All the possible arithmetic instructions are 
listed in Table 4.1, where we can see several patterns in the assigned function codes. For 
example, the value 20 (deriving from bit 10 in the instruction) is present for every quad- 
word instruction. We postpone consideration of the scaled addition and subtraction 
instructions until later in this chapter. 


Table 4.1 Alpha Arithmetic Instructions 


Mnemonic* Opcode Function Code Purpose 
addl 10 00 Longword addition 
addq 10 20 Quadword addition 
addlv,addl/v 10 40 Longword addition 
addqv, addq/v 10 60 Quadword addition 
s4addl 10 02 Scaled-by-4 longword addition 
s4addq 10 22 Scaled-by-4 quadword addition 
s8addl 10 12 Scaled-by-8 longword addition 
s8addq 10 32 Scaled-by-8 quadword addition 
subl 10 09 Longword subtraction 
subq 10 29 Quadword subtraction 
sublv, subl/v 10 49 Longword subtraction 
subqv, subq/v 10 69 Quadword subtraction 
s4subl 10 OB Scaled-by-4 longword subtraction 
s4subq 10 2B Scaled-by-4 quadword subtraction 
s8subl 10 1B Scaled-by-8 longword subtraction 
s8subq 10 3B Scaled-by-8 quadword subtraction 
mull 13 00 Longword multiplication 
mulg 13 20 Quadword multiplication 
wali meld fy 13 40 Longword multiplication 
mulqv,muld/v 13 60 Quadword multiplication 
umulh Le 30 Unsigned quadword multiplication high 





*Forms that signal overflow exceptions end in v (Unix) or /v (OpenVMS) 
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Arithmetic Overflow 


Arithmetic operations can result in overflow because all numeric representations in a 
computer are finite. Most simply, overflow arises when the result cannot fit within the 
size of the register or information unit for storage. With addition and subtraction, the 
apparent result then has the wrong sign. To see this, consider adding 2 plus 2 in a 3-bit 
two’s complement representation: 


N“ 


010 
+2 +010 
4 100 apparent binary result is —4 (wrong sign) 


Overflow can easily occur with multiplication because the product of two N-bit num- 
bers may need as many as 2N bits for storage without distortion. 

RISC systems require extra processor cycles to detect and report overflow syn- 
chronously. Therefore it is common to give the programmer (or compiler writer) a 
choice: either rapid calculations ignoring error conditions, or slower calculations with 
tracking of exceptions such as overflow. With the Alpha arithmetic instructions, special 
forms of the instruction mnemonics using the v suffix (Unix) or /v qualifier (Open- 
VMS) enable overflow to be detected and reported through error processing routines. 
High-level languages typically provide for such reporting. We will not use the special- 
ized v suffix or /v qualifier in our exposition in this book. 


Extended Precision Multiplication 


Multiplication can yield a product needing double-width storage, and certain algorithms 
require retention of all the bits of a product. Accordingly, many computer architectures 
facilitate such retention, though not necessarily using a single instruction. The umulh 
instruction in the Alpha has the two forms 


umulh Ra, Rb, Rc 
umulh Ra, lit,Rce 


This instruction stores in register Rc the high-order bits <127:64> of the 128-bit prod- 
uct resulting from multiplication of the unsigned integers contained in registers Ra and 
Rb, or in register Ra and the literal. If the two numbers being multiplied are actually 
signed integers, then the apparent result must be corrected by subtracting the value in 
Rb if the value in Ra is negative and by subtracting the value in Ra if the value in Rb is 
negative. The low-order bits <63:0> resulting from the same multiplication can be 
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obtained separately using a mulq instruction with the same operands regardless of the 
signs of the operands in registers Ra and Rb. 


What About Division? 


Like some previous architectures, the Alpha does not have a hardware opcode for inte- 
ger division. The Unix assembler offers several pseudo-instructions (divl, diva, 
divlu, divqu, reml, remq, remlu, remqu) for signed and unsigned division and 
remainder calculations using emulation through software subroutines. These destroy 
any previous information in registers R23—R25, R27, and R28 because scratch registers 
are needed within the subroutines. The OpenVMS system software library also contains 
routines for integer division. 

Other methods exist for rapid division in special cases. One involves shifting the 
binary representation of a number to the right in order to divide by powers of two. 
Another uses the instructions for floating-point division in conjunction with the instruc- 
tions for conversion between integers and floating-point numbers. We defer further 
treatment of integer division until later chapters. 


Introduction to Addressing Modes 


The topic of addressing modes—the means of specifying data locations—is much 
simpler for a RISC architecture than for CISC architectures. We have split our discus- 
sion of addressing modes into two parts. In this section we discuss the addressing 
modes that are pertinent to the Alpha architecture. Toward the end of the chapter we 
briefly describe some additional addressing modes that are used in other computer 
architectures. 

Most of the Alpha instructions of the integer operate class employ the same literal 
and direct addressing modes as the arithmetic instructions. Let us ensure that we attain 
an adequate understanding of these before proceeding. The basic concept involved with 
addressing modes is always to seek an answer to the question, “Where, and through 
what means, can the data be found?” 


Literal Addressing 


The instruction itself contains the operand data when literal addressing is used. This 
mode is sometimes also called immediate addressing. On the VAX and PDP-1I1, an 
immediate operand is stored adjacent to the other parts of the instruction; therefore, an 
additional fetch cycle is usually required. An immediate operand is part of the code sec- 
tion of a program, which can be treated as a read-only memory segment by an operating 
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system. Therefore literal addressing is almost exclusively used for numerical constants 
whose values are known at the time of program assembly or compilation. 

On the Alpha, no additional address calculation is required and no additional fetch 
cycle from memory is required to bring an immediate operand into the CPU. Only small 
positive values (0 to 255) can be accommodated as literals within the 32-bit width of 
any Alpha instruction. 


Direct Addressing 


The instruction itself contains the address of the data, not the actual data value, when 
direct addressing is used. For the Alpha and certain other contemporary architectures, 
this form of addressing is restricted to what used to be called register direct. Bits within 
the instruction specify “x” encoding the “address” (i.e., the number or name) of the reg- 
ister Rx that contains the data. Years ago, when computers had very little memory, some 
architectures could also specify full memory addresses within their instruction words. 


That form of addressing was called memory direct. 


Indirect Addressing 


When a register contains not the data, but an address pointer to the actual data, we refer 
to the process of accessing such data as indirect addressing. For many assemblers, the 
syntax for indirect addressing involves simply putting the name of the register in paren- 
theses, (Rx). A bit field within the instruction contains “x” encoding the name of the 
register. At the point during the execution of the instruction when the operand is 
involved, the contents of this register will be sent to the memory subsystem as the effec- 
tive address where the actual information unit is located. Again, years ago this address- 
ing mode would have been called register indirect in order to distinguish it from another 
variety of two-phase addressing where the address pointer would be found in a memory 
cell instead of a register, i.e., memory indirect addressing. Of these two possibilities, the 
Alpha supports only register indirect addressing. 


Displacement Addressing 


In a more general and more powerful addressing mode, the effective address is com- 
puted by combining the contents of a register and a constant called a displacement or 
offset. The instruction must allocate bits to designate the register and to specify the dis- 
placement. On the VAX and the PDP-11, the occurrence of a displacement is the main 
reason why instructions can have variable lengths. Displacement mode has also been 
called base addressing in some contexts, e.g., in certain IBM and Intel architectures 
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where some registers are called base registers. In the Alpha, every integer register can 
be used in conjunction with a displacement for this type of addressing. 

The assembler syntax for displacement addressing is disp (Rx), where disp 
can be an explicit signed value, symbolic value, or expression that can be evaluated by 
the assembler (or sometimes by the linker). The Alpha architecture uses displacement 
addressing only for load/store instructions, which we discuss in this chapter, and for 
jump instructions, which we discuss later in Chapter 5. For the Alpha, the displacement 
is a 16-bit signed number, ranging from -32768 to +32767 (see Figure 4.1). This value 
is sign-extended to the full 64 bits by replicating bit 15 into bits <63:16> when disp is 
being added to the value in register Rx. When the displacement is zero as a special case, 
0 (Rx), the effect is to implement indirect addressing on the Alpha. 

The MACRO-—64 assembler for OpenVMS handles an important form of implicit 
displacement addressing using a previously designated base register. Our illustrative 
programs for computing tables of squares have contained the directive 


. base r15,Sds 


which instructs MACRO-—64 that some earlier instruction has loaded the base 
address $ds into the register and that R15 can be used as a base register. The assem- 
bler maintains an active table of base register knowledge. If you no longer wish to 
use a register for base addressing beyond a certain point in a program, you can 
inform MACRO-—64 to remove it from its table of potential base registers using the 
directive without an address: 


. base t15 


When a symbolic address is encountered in a memory-reference instruction, the assem- 
bler consults its table of potential base registers. This table search starts with the lowest- 
numbered register until one is found for which the calculated displacement will fit 
within the space available in the instruction word. 

This capability of MACRO-64 leads to improved legibility of programs, since we 
can refer symbolically to any memory location. Our programs have already used this 
very intuitive construct. Look back at our load or store references to sq1 through sq3 
in SQUARES and SQUARES2?. The assembler can convert our intentions into displace- 
ment addressing using register R15. 

In contrast, the Unix assembler expects the compiler or assembly language pro- 
grammer to set up base registers and track the usage of them. Otherwise, left to itself 
with an instruction using what would be historically called direct addressing (like 1dq 


90 Chapter 4 « Alpha Instruction Formats and Addressing 


$x, Qquadword), the Unix assembler will generate sequences of instructions that func- 
tion correctly but perhaps inefficiently. 


Summary of Alpha Addressing Modes 


Table 4.2 recapitulates the four addressing modes for the Alpha. As already mentioned, 
indirect addressing is implemented as a special case of displacement addressing with a 
zero displacement. Be sure that you comprehend the concept of effective address. 


Table 4.2 Effective Address for Alpha Addressing Modes 


Addressing Mode Assembler Syntax Effective Address 
Literal o lit Instruction bits <20:13> if bit 12 has the 
value 1 in an integer operate instruction 
Direct Rx The named register 
Indirect (Rx} or 0(Rx) Contents of the named register 
Displacement disp (Rx) Contents of the named register added to 


disp, a signed 16-bit displacement 


RADIX: Using Arithmetic Instructions 


We now illustrate the Alpha arithmetic instructions with a brief program (Figure 4.3) 
which, like our programs for computing a table of squares, consists of a simple 
sequence of instructions without branching. Suppose that we have stored the values of 
some decimal digits in quadwords starting at address D1. Further suppose that we want 
to combine these digits into a single numerical value in register RO. 


/* RADIX Number Conversions (Unix) */ 
/* This program will convert the positive number expressed by 


3 digits in radix RAD stored beginning at D1 into a value 
in register RO. */ 


RAD = 10 # Define a symbolic constant 
LLES # Store quad constants 

Dis . quad Daiya # Interpret as "914" 
.text Section for program code 
-align 4 Octaword alignment 


Disallow rearrangements 

These three lines 

mark the mandatory 
'main' program entry 


.set noreorder 
-globl main 
.ent main 


+ HF HH H 


main: 


RADIX: Using Arithmetic Instructions 
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ldgp Sgp,0($27) # Load the global pointer 
.frame S$sp,0,$26,0 # Describe the stack frame 
.prologue 1 # Say that Sgp is in use 
-globl first 
first: mov +RAD, $2 # R2 = radix RAD 
lda $3, D1 # R3 -> D1 
ldq SLU (Ss) # Get first digit 
mulq C1, So, 50 # Multiply value by RAD 
ldq $1,8(S3) # Get next digit 
addq $1, 50,50 # Add next digit 
mulg $0,$2,$0 # Multiply value by RAD 
ldq $1,16($3) # Get next digit 
addq $1,$0,$0 # Add next digit 
# We could go further.... 
done: mov 0,$0 # Signal all is normal 
ret S31, ($26) # Back to Unix environment 
.end main # Mark end of procedure 
.title RADIX Number Conversions (OpenVMS) 


; This program will convert the positive number expressed by 
; 3 digits in radix RAD stored beginning at D1 into a value 


; in register RO. 
RAD = 10 


Sroutine 


kind=stack, 


Sdata_section 
Di:: . quad 9,1,4 


Scode_section 


. base T21,5is 
ldq r15,$dp 
. base ri5,5as 
first:: mov +RAD, r2 
lda 23 Di 
ldq ri, 0(R3.) 
mulq PL. is , BO 
ldq r1,8(R3) 
addq ei. j +0, £0 
mulgq EOZ; ED 
ldq 1,16 (x3) 
addq r1,70,xr0 
done:: mov irrg 
$return 
$end_routine radix 
.end radix 


. 
1 


. 
d 


Define a symbolic constant 


radix, data_section_pointer=true, - 
saved_regs=<r2,r3,r15> 


Interpret as "914" 


R27 -> linkage section 
R15 -> data section 
Tell MACRO to use this 
R2 = radix RAD 

R3 -> D1 

Get first digit 
Multiply value by RAD 
Get next digit 

Add next digit 
Multiply value by RAD 
Get next digit 

Add next digit 

We could go further.... 
Signal all is normal 
Return to OpenVMS 
Needed by Sroutine 

Set start address 


Figure 4.3 RADIX: An illustration of arithmetic instructions 
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The algorithm to accomplish this task just applies the weighting scheme that we 
outlined in Chapter 1, using successive powers of the radix. That is, the most significant 
digit is multiplied by the radix, the next most significant digit is added, and this process 
is continued—multiplying the intermediate result by the radix and then adding the value 
of the next digit—until all the digits have been processed. 

The program defines the assembler symbolic constant RAD as a particular value 
(here 10). The Unix version uses the .11it8 directive, which permits the assembler to 
use efficient addressing for constant quadword data. Both versions contain instructions 
at first to put the radix RAD into register R2 and to put the address D1 into register 
R3. The unary plus sign in front of RAD helps the assemblers recognize that an expres- 
sion is to be reduced to a known constant which can be represented using a literal. 

If you have access to a Unix system, you can stop the program at label done and 
inspect the contents of register RO using the following commands to dbx: 


/done 

stop at 31 
run 

px $r0 


Similarly, if you have access to an OpenVMS system, you can stop the program at label 
done and inspect the contents of register RO using the following commands to the sym- 
bolic debugger: 


set break done 
go 
examine/hex r0 


Alternatively, with the OpenVMS symbolic debugger you could use these commands to 
set up an appropriately partitioned screen display: 


dbg> set mode screen 
dbg> display reg at rhl do (examine/hex RO:R3,R15, PC) 


Then repeated step commands would move through the program and show its behav- 
ior in detail during each instruction. 

The final answer is 39216 = 914 0. This simple example again shows the essence 
of a load/store architecture. Each item of stored information must be loaded from mem- 
ory into a register before it can be operated upon. The operate instructions cannot 
directly access data. 
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Integer Load and Store Instructions 


We will now discuss some of the Alpha load and store instructions, which fall into four 
groups (Table 4.3): load register with address (1dax) , load register with data (1d¢), 
store data from register (st t), and special “locked” and “conditional” versions needed 
by synchronization routines within an operating system in multi-processor configura- 
tions. The character t = 1 (longword) or t = q (quadword) specifies the information 
unit size, and a second kind of load address instruction has x = h. All of these instruc- 
tions use direct addressing for one register operand (Ra) and either indirect or displace- 
ment addressing for the other operand using register Rb (see Figure 4.1). 


Table 4.3 Alpha Integer Load and Store Instructions 


Mnemonic Opcode Purpose 
lda 08 Load Ra with effective address, i.e., Rb + disp 
ldah 09 Load Ra with effective address high, i.e., Rb + disp*65536 
1dl 28 Load Ra with longword contents found at effective address 
ldq 29 Load Ra with quadword contents found at effective address 
ldq_u OB Unaligned quadword load without generating an exception 
stl 2C Store longword contents in Ra at effective address 
stq 2D Store quadword contents in Ra at effective address 
stq_u OF Unaligned quadword store without generating an exception 
al í 2A Load longword locked 
ldq_1 2B Load quadword locked 
stl_c 2E Store longword conditional 
sta e 2F Store quadword conditional 


In these Alpha memory-reference instructions, the displacement is allocated only 
16 bits. Since this displacement is interpreted as a signed byte address offset, only a 
range of data addresses from -32768 to +32767 with respect to register Rb is accessible. 
If some data needed at the same time are more widely separated than 64K addressing 
units, more than one base register Rb can be used. Often, several registers may be in use 
rather naturally for that purpose within a routine because several logically distinct data 
structures are being accessed. 


Load Address Instructions 


Even from the earliest stages of learning computer science, a student comes to appreci- 
ate the extraordinary power of pointers, i.e., address values that can be used to find data 


94 Chapter 4 ¢ Alpha Instruction Formats and Addressing 


storage locations. Examples of pointers in everyday life include house addresses, tele- 
phone numbers, and e-mail names. 

At the assembly language level, an effective address is a pointer value. The Alpha 
load address instructions can be used to set up a register as a pointer. The essence of the 
technique is most easily shown using the conventions of the OpenVMS assembler, as in 
this schematic program fragment: 


Sdata_section 


.blkq <some previous data> 

TheData: .blkq <lots of quadwords> 
.blkq <some subsequent data> 
Scode_section 
.base R27 ,S1s 
ldg R15,Sdp 
. base R15, $ds 
lda R14,TheData ; R14 -> quadwords 
ldg RO, (R14) ; RO = 1st quadword 
ldq R1,8(R14) ; R1 = 2nd quadword 

; etc. 


Here the 1da instruction will load register R14 with the address equivalent to the loca- 
tion designated by the symbol TheData which, you will recall, is subject to relocation 
by the linker. This register can be used in displacement addressing (and its special case, 
indirect addressing) for the purpose of bringing data from selected information units 
into other registers for processing. 

The load address (1da) instruction puts into Ra the value resulting from adding 
the sign-extended displacement to the current value in Rb. Remember that a two’s com- 
plement integer can be extended from a compacted form to any width by replicating the 
sign bit to the left arbitrarily far. 

The load address high (1dah) instruction works similarly, except that the dis- 
placement is first multiplied by 65536 before being sign-extended. This instruction is 
useful for constructing pointers in larger addressing contexts, for example: 


MEMO = 0 
MEM1 = 1 
MEM2 = 2 


; etc. 
Sdata_section 
BigData: .blkq <huge amount of storage allocated> 
Scode_section 
.base R27, S18 
ldq R15, sap 
. base R15,$ds 
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lda RO, BigData ; RO -> BigData 

ldah R10,MEMO (RO) ; R10 -> BigData + 0*64K 

ldah R11,MEM1 (RO) ; R11 -> BigData + 1*64K 

ldah R12,MEM2 (RO) ; R12 -> BigData + 2*64K 
; etc. 


.base R10,BigData 
. base R11,BigData+1*65536 
. base R12,BigData+2* 65536 
> S&C. 


This sequence would set up address pointers for memory in 64K chunks. Since the dis- 
placement field in any load or store instruction is only 16 bits wide, and thus can span 
only a range of 64K, more than one base register can be required whenever a program’s 
dynamic data storage requirements exceed 64K bytes. 


Non-addressing Uses of Load Address Instructions 


Many mathematical applications involve the use of specific numerical constants. If a 
quantity is truly constant in an algorithm, it can be “hard-coded” into instructions with- 
out problems. (If a quantity is situation-dependent, it should instead be allocated to a 
memory unit and be accessed symbolically.) 

One special number is always right at hand in an Alpha, because integer register 
R31 contains a read-only copy of the number zero. The hardware ignores any attempt to 
modify R31 using any Alpha instruction. The lda and 1dah instructions have non- 
addressing uses in putting numerical contents into registers. The hardware does not 
know whether a binary number represents an address or not. Register R31 in displace- 
ment mode has the following effects with 1da and 1dah instructions: 


lda Ra,NUMBER(R31) ; Ra = 0 + Sign-extended NUMBER 
ldah Ra,NUMBER(R31) ; Ra = 0 + Sign-extended NUMBER * 65536 


where NUMBER can be an assembler expression. The 1da form is one of the methods 
used by the assembler to implement the pseudo-instruction mov literal, Rx that we 
have intuitively used in our programs. (Another method that the assembler may use for 
small positive literals will be pointed out to you in a later chapter.) 

Here we have an interesting contrast and an important lesson about the hardware- 
software borderline. In some architectures, a register can be initialized to a particular 
numerical value using literal addressing with a genuine mov opcode. In the Alpha 
architecture, the same effect can be achieved using displacement addressing with 
respect to the known special value of zero for register R31 with the 1da opcode. We 
give two examples: 
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mov 32767,R3 becomes 207F7FFF or lda R3,32767(R31) 
mov -32767,R3 becomes 207F8001 or Ilda R3,-32767 (R31) 


The designers of the Alpha processor architecture surely recognized the need for mov- 
ing a constant into a register. The designers of the assemblers included the built-in sym- 
bol table entry for mov and the syntactical rules patterned after the mov instructions in 
the PDP-11 and VAX architectures for the special case of moving a literal into a regis- 
ter. A relatively simple software extension thus compensates for what would be a redun- 
dant capability if the hardware had an actual mov opcode, since 1da already can 
accomplish what is required. 


The reference manual for the OpenVMS assembler lists over twenty “decodable 
pseudo-operations” besides mov. In a somewhat bolder approach, the programmer’s 
guide for the Unix assembler presents a blended set of assembly-language instructions. 
The set includes true machine-code instructions, pseudo-operations like mov, and 
numerous instructions that generate more than one machine-code instruction. The latter 
circumstances include not only integer division, but special situations where a poten- 
tially simple machine-code instruction is actually expanded into more than one instruc- 
tion. Thus with the Alpha, computer architecture has come to a juncture where the 
hardware-software border blurs. Moreover, the assembly language programmer in an 
OpenVMS environment “sees” a somewhat different machine than someone else who 
programs in a Digital UNIX environment even if the “box” they acquired is the same 
(apart from system-specific PALcode). 


Load and Store Operations 


In the Alpha architecture the integer load instructions (1d1 and 1dq) and store instruc- 
tions (St 1 and stq) all work in the same basic way with respect to calculation of the 
effective address. The signed 16-bit displacement found in bits <15:0> of the instruc- 
tion word is sign-extended to full 64-bit width and added using two’s complement arith- 
metic to the value in the register Rb whose name (number) is found in bits <20:16> of 
the instruction word. The numeric result is the effective virtual address submitted to the 
memory storage subsystem of the computer. Of course, the effective address is an 
unsigned quantity. If the value in register Rb were zero and the displacement were —8, 
the effective address would actually correspond to the highest possible quadword 
address for the Alpha (i.e., the address space would appear to “wrap around”). 

For the load instructions, the contents of the quadword- or longword-length infor- 
mation unit at the effective source address are brought into the central processor. With 
ldl, the longword-length datum is itself sign-extended to full 64-bit width before 


Integer Load and Store Instructions 97 


being put into the destination register Ra. That way, the value will automatically work 
with either longword or quadword integer operate instructions without any further con- 
version. With all the load instructions, the destination for the data is the register Ra 
whose name (number) is found in bits <25:21> of the instruction word. The assembler 
syntax is: 


ldt Ra, disp (Rb) 
lat Ra, source 


where the information unit type t = 1 for longword data or t = q for quadword data, 
and where disp is a signed displacement to be added to the current address value in 
register Rb or where source is a symbolic address that is within 32K addressing units 
(in either direction) from a base register value known to the assembler. 

For the store instructions, the source for the data is the register Ra whose name 
(number) is found in bits <25:21> of the instruction word. The entire contents of this 
register (stq) or the longword-length contents from bits <31:0> of this register (st1) 
are dispatched from the central processor for storage at the destination specified by the 
effective address. The assembler syntax is: 


stt Ra, disp (Rb) 
Ste Ra, destination 


where the information unit type t = 1 for longword data or t = q for quadword data, 
and where disp is a signed displacement to be added to the current address value in 
register Rb or where destination is a symbolic address that is within 32K address- 
ing units in either direction from a base register value known to the assembler. 

All of these ordinary load and store instructions require that the information unit 
in memory be naturally aligned, that is, the effective address for a quadword is exactly 
divisible by 8 (bits <2:0> in the address must be zero) or for a longword is exactly divis- 
ible by 4 (bits <1:0> in the address must be zero). If the specified information unit is not 
naturally aligned, a time-consuming hardware exception will occur that the system soft- 
ware will have to process. 

The Alpha architecture also provides load quadword unaligned (1dq_u) and store 
quadword unaligned (stq_u) instructions that can access data at any effective address 
without incurring a hardware exception. With these instructions, bits <2:0> of the effec- 
tive address are forced to zero before the effective address is conveyed to the memory 
subsystem. It is then up to the assembly language programmer (or compiler writer) to 
deal with the consequences of having wanted to access data at an unaligned address, 
because the hardware has actually worked with an aligned address whose fetched or 
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stored contents partially overlap with the contents that were actually desired. These 
unaligned load and store instructions will be discussed further in Chapter 6. 


Addressing Details in the RADIX Program 


We can now benefit from looking more closely at some of the addressing details in the 
RADIX program. If you have access to a Unix or OpenVMS programming environ- 
ment, we strongly encourage you to follow through one of these subsections as we 
show how to investigate the corresponding version of RADIX (Figure 4.3). 


Unix Addressing 


We can use the Unix nm command with the -x flag to obtain a symbol table for 
RADIX. Only these few entries, which we have rearranged into descending order by 
address, will concern us here: 


_gp 0x00000140008090 
.11t8 0x000001400000a0 
D1 0x00000140000080 


.data 0x00000140000000 
first 0x00000120001158 
main 0x00000120001150 
. text 0x00000120001060 


Although addresses on the Alpha are 64 bits wide, the nm command displays only 14 
hex characters to the right of 0x. The Unix system loader initiates a program with both 
the program counter and register R27 preset to contain the starting address for main. 
While it executes, the routine then has “self-awareness” of its own address space 
because the value in register R27 serves as a master pointer. 

What does the 1dgp pseudo-instruction at main do? Using dbx we can issue the 
command main/2i and see that two machine instructions were generated: 


(dbx) main/2i 
[mains17, 0x120001150] ldah op, 6192 (r27) 
[main:17, 0x120001154] lda gp, 28480 (gp) 


where the displacements are decimal and where, curiously, dbx displays the names of 
registers as gp and r27, not quite the same as the $gp and $27 in our source program 
or the Sgp and $r27 which dbx itself would require as input, say on a px command. 
Again using dbx, we can issue the command main/2xx and look at these same two 
instructions expressed as longwords in hexadecimal: 
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(dbx) main/2xx 
0000000120001150: 27bb2000 23bd6£40 


Now we can verify the result of their execution, i.e., determine what values go into reg- 
ister gp (register R29), which is called the global pointer: 


gp <— 0x00000140001150 = 0x00000120001150 + 0x00000020000000 
gp <— 0x00000140008090 = 0x00000140001150 + 0x00000000006£40 


The final result matches the value of the symbol _gp as obtained using the nm com- 
mand. 

Just below first, we used an 1da instruction, but the Unix assembler has mod- 
ified this to some kind of 1dq instruction, as we can see using dbx with the command 
riree/si: 


(dbx) Eirst/2i 
[main:21, 0x120001158] addq r31,~ Oza, E2 
[main:22, 0x12000115c] ldq x3, =-32736(gp) 


and the command first/2xx: 


(dbx) first/2xx 
0000000120001158: 43e15402 a47d8020 


Note in passing that the assembler has implemented the mov pseudo-instruction at 
first using a variant of the addq instruction. 

Now, what is the effective address -32736 (gp) in the 1dq instruction that the 
assembler has substituted for our 1da instruction? 


effective address = 0x00000140008090 + OxfffffffFfff8020 
0x000001400000b0 


What value stored at that address is loaded into register R3? Using dbx: 


(dbx) 0x00000001400000b0/xX 
00000001400000b0: 0000000140000080 


we find a value that matches the value of the symbol D1 as obtained previously using 
the nm command. That is, the assembler put an address pointer to D1 (i.e., the address 
of D1) into a quadword in the . 1it8 section, where it can be located at run time using 
the global pointer. 
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Finally, if you are still skeptical about all these calculated 64-bit addresses, we can 
look for the data specified in the source program: 


(dbx) 0x0000000140000080/3D 
0000000140000080: 9 1 
0000000140000090: 4 


We suggest that you consolidate your understanding by drawing a diagram of memory and 
relevant registers annotated with all the addresses and contents given in this subsection. 


OpenVMS Addressing 


We begin by excerpting from the link map certain values relating to the organization of 
our program into psects (program sections) and the symbols we used and rearranging 
those into descending order by address (hexadecimal) as follows: 


Psect Name Module Name Base 
SCODES 00030000 
SDATAS 00020000 
SLINKS 00010000 
Symbol Value 

DONE 00030048-R 

FIRST 00030024-R 

D1 00020000-R 

RADIX 00010008-R 


Although addresses on the Alpha are 64 bits wide, the link map displays only 8 hex 
characters for these particular relocatable addresses. The OpenVMS system loader ini- 
tiates a program with the program counter preset to contain the starting address (30000 
in this case, because the $routine macro inserts several instructions ahead of the pro- 
grammer’s code) and with register R27 pointing to the label RADIX established by the 
Sroutine macro. 

What does the 1dq instruction just above first do? Using the debugger, we can 
issue an examine command to show the three instructions just before, exactly at, and 
just after first: 


DBG> examine/instruction first-4:first+4 

RADIX\%LINE 2141: LDQ R15, #XFFF8 (R27) 
RADIX\%LINE 2143: BIS R31, #X0A,R2 
RADIX\%LINE 2144: LDA R3, (R15) 
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The OpenVMS assembler accepts, but does not require, a number sign in front of 
expressions for displacements and literals in the operand field. Now, what is the effec- 
tive address #XFFF8 (R27) in the 1dq instruction? It will be the sum of the value in 
register R27 (i.e., the value of the symbol RADIX) and the sign-extended displacement: 


effective address = 0000000000010008 + FFFFFFFFFFFFFFF8 
= 0000000000010000 


What value stored at that address is loaded into register R15? Using the debugger: 


DBG> examine/quadword 10000 
0000000000010000: 0000000000020000 


we find a value that matches the start of the SDATAS psect (program section). 

Note in passing that the assembler has implemented the mov pseudo-instruction at 
first using a variant of the bis instruction (logical OR, to be discussed in Chapter 5). 

How does the 1da instruction just below first work? In this case, the displace- 
ment from the base register R15 is zero. That is because the assembler started putting 
our data values marked by the label D1 at the beginning (address 20000) of the 
SDATAS psect. More generally, there would be a displacement to be used in conjunc- 
tion with base register R15. 

Finally, if you are still skeptical about all these calculated 64-bit addresses, we can 
look for the data specified in the source program: 


DBG> examine/quadword 20000:20010 
RADIX\D1: 0000000000000009 
D1+8: 0000000000000001 

D1+10: 0000000000000004 


We suggest that you consolidate your understanding by drawing a diagram of mem- 
ory and relevant registers annotated with all the addresses and contents given in this 
subsection. 


Accessing Simple Record Structures Using 
Displacements 


In many applications, the data will comprise numerous records, each having identical 
internal structure in the form of fields of information. An architecture that supports dis- 
placement addressing can access such data conveniently. Consider the fictitious situa- 
tion of a company that makes 40 kinds of widgets. The general manager of this 
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company tracks the cost of manufacture, the selling price, the shipping cost, and the 
number of units in current inventory for each kind of widget. For simplicity, each such 
piece of information about the widgets could be stored as a quadword. 

The same fields of information would occur for each of 40 kinds of widgets, and it 
makes sense to store those fields always in the same relative order for each kind. This 
can be effectively done by defining some displacements, or address offsets: 


MODEL = 0 
COST = 8 
PRICE = 16 
FREIGHT = 24 
STOCK = 32 


In a program, one register can be stepped along by the total record size (here, 40 bytes), 
and the symbolic displacements will then specify each desired field. For example, the 
total cost of manufacture of all the widgets in the inventory can be computed schemati- 
cally as follows: 


lda R3 , TABLE ; R3 -> all the data 

mov 0, R4 ; R4 = value accumulator 
loop: ldq R5, COST (R3) = RS = UNE. cost 

ldq R6, STOCK (R3) > R6 = units in inventory 

mulg R5,R6,R6 ; R6 = value for that kind 

addq R6,R4,R4 ; R4 = running value 

goto done if at end of table 

lda R3,40(R3) ; Move to next kind 

BR loop 


done: 


where we defer consideration of the details about loop control until Chapter 5. The 1da 
R3,40(R3) instruction makes clear the role of R3 as an address pointer in a way that 
addq R3,40,R3 would not. 

Displacement addressing in the Alpha load and store instructions can thus handle 
record structures in a very natural and efficient way because only one base register is 
required for accessing every field of every record. 


DOT_3: Using Load and Store Instructions 


We will now illustrate a technique that is useful when the load and store instructions 
need to refer to corresponding constituents in distinct data structures of the same for- 
mat. Three-component vectors are common in physics and engineering problems. In 
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vector algebra, the scalar product (also called the inner product, or the dot product) is 
formed as a sum of products of corresponding components: 


V: W =vy Xwy+ Vy X Wy + vz X Wz 


It makes sense to store the X, Y, and Z components of each vector in adjacent informa- 
tion units. We will select quadword storage for two vectors V1 and V2 in our sample 
program (Figure 4.4). Defining three symbolic constants for X, Y, Z will then enable us 
to map these little data structures readily. For example, the address of the Y component 
of V1 could be symbolically represented as V1+Y. The assembled load and store 
instructions will contain appropriate displacements (X, Y, or Z) from a base register 
(R14 in our case for V1). 





mulg S14 S2;S21 
addq $1,90: $0 
ldg $1,2($14) 


Rl = y*y now 
Update the sum 
R1 = z component of V1 


JS BOT _3 Scalar product of 3-vectors (Unix) */ 
/* This program computes the scalar ("inner", "dot") 
product of two 3-component vectors V1, V2. */ 
X = ø # Offset for x component 
¥ = 8 # Offset for y component 
Z = 16 # Offset for z component 
Rs vi ves 
Vl: . quad -1,+3,+5 # 3-vector named V1 
Va: . quad -2,-4,+6 # 3-vector named V2 
.text # Section for program code 
-align 4 # Octaword alignment 
.set noreorder # Disallow rearrangements 
-globl main # These three lines 
.ent main # mark the mandatory 
main: z 'main' program entry 
ldgp Sgp,0($27) # Load the global pointer 
.frame S$sp,0,$26,0 # Describe the stack frame 
.prologue 1 # Say that $gp is in use 
GLlL6obl first 
first: lda $14,V1 # R14 -> vector V1 
lda Si5,V2 # R15 -> vector V2 
addq $31,631, $0 # RO = running sum 
ldq $1,X($14) # R1 = x component of V1 
raq S23 (S15) # R2 = x component of V2 
mulq 01.82. S1 # R1 = x*x now 
addq $1,$0,$0 # Update the sum 
ldq Shy, © Ueda) # R1 = y component of V1 
ldg S2 Y (S15) # R2 = y component of V2 
= 
= 
= 


104 Chapter 4 « Alpha Instruction Formats and Addressing 
ldq S22 ($15) # R2 = z component of V2 
mulq So y Sea wee # RL = z*z now 
addq S11, 50,30 # Update the sum 

done: mov O30 # Signal all is normal 
ret S31; ($26),1 # Back to Unix environment 
.end main # Mark end of procedure 
OOS EE 
.title DOT_3 Scalar product of 3-vectors (OpenVMS) 
; This program computes the scalar ("inner", "dot") 
> product of two 3-component vectors V1, V2. 
X = Q | ; Offset for x component 
Y = 8 ; Offset for y component 
Z = L6 ; Offset for z component 
Sroutine dot_3, data_section_pointer=true, - 
kind=stack, saved_regs=<r2,r14,r15> 
Sdata_section 
Vises . quad -1,+3,+5 
Viet 8 . quad -2,-4,+6 
Scode_section 
start:: .base t27,$5ls ; R27 -> linkage section 
ldq r15.,$dp ; R15 -> data section 
. base r1.5,$5das ; Tell MACRO about this 
firsts: ida £14 Vi ; R14 -> vector V1 
lda E15, V2 ; R15 -> vector V2 
addq P31. E a E d ; RO = running sum 
ldg ri, X (rid) ; Rl = * component of V1 
ldq F2, A (LS) ; R2 = x component of V2 
mulg Ely re eet = Ki = x*x now 
addq ri; ru.rd ; Update the sum 
ldq 1, ¥ (714) ; RL = y component of Vi 
ldq £2 ,¥ (r15) ; R2 = y component of V2 
mulg ril;rszri ; R1 = y*y now 
addq Fi , 20, r0 ; Update the sum 
ldg Pd ips (Ee) ; R1 = z component of V1 
ldg Te re (El ; R2 = z component of V2 
mulg Fi ee , 1 ; RL = z*z now 
addq 1, £0, £0 ; Update the sum 
done:: mov Ae og ; Tell OpenVMS 
Sreturn ; we're ending normally 
Send_routine dot_3 ; Needed by Sroutine 
.end dot_3 Set start address 


Figure 4.4 DOT_3: An illustration of memory 


a 





ddressing 
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One particular instruction in the DOT_3 program deserves special mention, 
namely, a method for initializing a register to the number zero. In the Alpha, integer 
register R31 is always zero. Thus we can use this built-in known value with any suit- 
able integer operate instruction whenever we want to. Since zero plus zero yields zero, 
we have chosen an addq instruction to set up register RO as the accumulator variable to 
hold the sum of products. 

With this background, you should have little difficulty in following the flow of the 
entire calculation. What numerical result should be in register RO when the program is 
paused at done? Again, we would recommend a careful and active study of this sample 
program using the debugger if you have access to an actual Unix or OpenVMS pro- 
gramming environment. Monitor the contents of registers RO and R1 as you step 
through the sequence of instructions down to the line with the label done. Be attentive 
to the two’s complement arithmetic operations. You should see some alternations of 
algebraic sign in the values in these registers from step to step. 


Addressing Modes in Other Architectures 


RISC architectures typically support about the same number of addressing modes as the 
Alpha. CISC architectures, however, have almost invariably offered several other 
addressing modes. We will briefly describe a sampling of those modes in this section 
before concluding this chapter with the scaled arithmetic instructions on the Alpha, 
which can be used in multi-statement addressing sequences equivalent to some of the 
single-statement CISC modes. 

The earlier VAX and PDP-11 members of an extended family of computer archi- 
tectures again provide a useful context for discussion. These architectures have 
included three variants of register indirect addressing, which have assembler syntax and 
function as follows: 


Register Indirect (Rx) Effective address = contents of Rx 
Autoincrement (Rx)+ Effective address = contents of Rx 
Then Rx is advanced to the next information unit 
Autodecrement -(Rx) Rx is moved back to the previous information unit 
Then effective address = decremented contents of Rx 


The computer can automatically adjust the register by an amount (1, 2, 4, 8 bytes) cor- 
responding to the size of the information unit ultimately accessed (byte, word, long- 
word, quadword), because that size is implicit in the opcode of the instruction. 
Autoincrement mode is quite prevalent among computer architectures, autodecrement 
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somewhat less so. Collectively these modes provide a methodical accessing of elements 
in array storage for high-level language applications. 

Taken in combination, autodecrement and autoincrement mode facilitate the 
implementation of a last-in first-out (LIFO) stack in memory. A program may manage 
several stacks, using different registers as stack pointers. Commonly the principal stack, 
used for preserving and restoring context information in connection with procedure 
calls, uses a register named SP. Stack operations proceed schematically as follows: 


mov data, - (SP) + Push a unit of data onto stack 

+ Other instructions (block A) 
opcode (SP) ; Access top datum on stack 

> Other instructions (block B) 
mov (SP)+,data ; Pop a unit of data off the stack 


Proper programming technique permits any sequence of a program (block A, block B) 
to use stack storage, but each sequence must pop exactly as many items as it has 
pushed. This takes on critical importance in system-level programming when asynchro- 
nous external events such as hardware interrupts can initiate routines that must begin by 
storing the context of the temporarily interrupted process on the system stack. 

Let us return to the progression from literal addressing to register direct address- 
ing to register indirect addressing. From the vantage point of the central processing unit, 
there is progressively more addressing effort and potentially more delay in getting to 
the real data. This progression could be extended one more degree beyond register indi- 
rect addressing, with the first information unit from memory containing not the real 
data, but rather another pointer (effective address) leading to the data. This mode has 
been called register indirect deferred addressing. Both the VAX and the PDP-11 offer a 
version of such an addressing mode called autoincrement deferred, with the assembler 
syntax @(Rx)+. The contents of the register represent the address of the memory unit 
which, in turn, contains the effective address of a specified datum. After serving as 
pointer-to-the-pointer, the register is automatically advanced by one word (PDP-11) or 
one longword (VAX). This mode is useful for algorithms that involve lists of address 
pointers, such as tag sorts, or that involve passing arguments to external procedures by 
address. The PDP-11, but not the VAX, also has an autodecrement deferred addressing 
mode, with the assembler syntax @-(Rx). 

Many architectures conveniently offer a form of direct addressing (symbolic 
addressing) that is not true memory direct addressing, but is actually implemented as 
displacement addressing using the program counter (PC) as the base address: 


opcode symbol is equivalent to opcode disp(PC) 
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where the assembler and linker assure that the displacement measures the addressing 
“distance” between this instruction and the datum at the location denoted by symbol. 
This is also called relative addressing because, after linking, all data locations can be 
located by displacements relative to the program counter. This would not work for the 
Alpha because of the choice of only a 16-bit displacement field for Alpha load and store 
instructions. A few other RISC architectures, and some CISC architectures, implement 
not only this addressing mode but also a mode called relative addressing deferred: 


opcode @symbol is equivalent to opcode @disp(PC) 


These two modes are very convenient to the programmer because of the syntactical 
resemblance to manipulations of variables in high-level languages. 

This section has not covered the topic of addressing modes exhaustively, and 
almost every CISC architecture seems to have one or more modes unique to itself. Table 
4.4 summarizes the occurrence of the modes we have discussed for a classic 16-bit 
design (PDP-11), for two 32-bit CISC designs (VAX and Motorola® 680x0), and for 
the Alpha. Clearly the load/store RISC architecture of the latter has a drastically simpli- 
fied set of addressing modes. 


Table 4.4 Summary of Addressing Modes for Various Architectures 


Mode Syntax PDP-11 VAX 680x0 Alpha 
Literal LIE OE Yes* Yes Yes Integer operate instruc- 
#1it tions only 

Register direct Rx Yes Yes Yes Yes 

Register indirect (Rx) Yes Yes Yes Special case of dis- 
placement mode 

Displacement disp(Rx) Yes Yes Yes Load/store and jump 
instructions only 

Displacement deferred @disp(Rx) Yes Yes No No 

Autoincrement (Rx) + Yes Yes Yes No 

Autodecrement ~ (Rx) Yes Yes Yes No 

Autoincrement deferred @ (Rx) + Yes Yes No No 

Autodecrement @- (Rx) Yes No No No 

deferred 

Memory direct addr Yes" Yes* Yes* See displacement mode 

Memory indirect @addr Yes* Yes* No No 

Other modes Yes Yes Yes No 


* Actually implemented using the PC and one of the other modes. 
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Short sequences of Alpha instructions can readily emulate some of the compli- 
cated modes of addressing offered by CISC architectures. For example, autoincrement 
and autoincrement deferred modes can be accomplished as follows: 


(Rpointer)+ ldgq Rdest, (Rpointer) 
lda Rpointer, 8 (Rpointer) 


@(Rpointer) + ldq Rdest, (Rpointer) 
ldq Rdest, (Rdest) 
lda Rpointer,8(Rpointer) 


where Rpointer is the selected index register and Rdest can serve to receive the 
data value and (in the second Alpha illustration) to hold the intermediate address 
pointer. Actually, the sequence of simpler Alpha instructions quite nicely annotates 
what substeps would be involved during an execution cycle for the operand processing 
indicated by the CISC assembler syntax. 

Thus a reduced instruction set need not be seen as inadequate in comparison to 
other architectures. Indeed, a RISC computer has a performance advantage if it can exe- 
cute such sequences of simpler instructions in a shorter overall time than is required for 
the corresponding single instruction for a CISC computer. 


Scaled Addition and Subtraction Instructions 


The Alpha has additional forms of the integer addition and subtraction instructions, 
called scaled addition and subtraction (Table 4.1), which nondestructively multiply the 
value in register Ra by either 4 or 8 before that first operand is used in the calculation: 


sxaddt Ra, Rb,Rc > Rc <-- x * Ra + Rb 
sxaddt Ra, Lit, Re > Rc <-- x * Ra + lit 
sxsubt Ra, Rb, Rc > Rc <-- x * Ra - Rb 
sxsubt Ra, Lit, Re > Re <-- x * Ra - lit 


where the multiplier is x = 4 or 8 and the type of information unit is t = 1 (longword) 
or t = q (quadword). For t = 1, the high-order 32 bits of registers Ra and Rb are 
ignored and the result to be stored in register Rc is computed using sign-extended ver- 
sions of the low-order 32 bits in the addition or subtraction. 

One of these forms of scaled arithmetic instruction has an application in comput- 
ing the address of a given element of an array: 


address = (size of information unit) * (element number) + (address of array origin) 


sxaddt Relement, Rorigin, Raddress 
Iag Rdest, (Raddress) 





Summary 109 


where x = 4 and t = 1 for an array of longword data or x = 8 and t = q for an array of 
quadword data, Relement contains the index value of the element (numbered 
upwards from zero, not one), Rorigin contains the address of the zeroth element, and 
Raddress will contain the address of the particular element sought. 

The Unix assembler may substitute short sequences of scaled addition and/or sub- 
traction instructions when a multiplication would otherwise contain a literal. For exam- 
ple, if we had written the RADIX program using 


RAD = 10 
mul10: mulg $1,RAD,$0 


the assembler would have actually produced a program containing the sequence 


mullo: s4addq $1,$1,$0 # RO = 4*R1 + R1 = 5*R1 
addq $0,$0,$0 # RO 5*R1l + 5*R1L = 10*R1 


I 


where we have added comments to show why this sequence yields mathematically 
identical results. We consider in Chapter 11 whether such substitutions by assemblers, 
compilers, or programmers are optimal in terms of machine performance. 


Summary 


We have presented in this chapter general layouts for the principal classes of Alpha 
machine instructions. We have discussed and illustrated some of the uses of two groups 
of Alpha instructions, namely, the integer arithmetic operations and the load address 
and load/store data operations. We have also outlined the most important addressing 
modes used in other contemporary computer architectures, with an emphasis on the role 
of registers. 
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EXERCISES 


4.1 
4.2 


4.3 


4.4 


4.5 


4.6 


4.7 


4.8 


4.9 


Find the hexadecimal values of all the unique opcodes in SQUARES? (Figure 3.2). 


Use the debugger on SQUARES2 (Figure 3.2) to deduce which register is “Ra” for 
each unique instruction in that program. Do you see any consistent pattern with 
respect to the order in which operands are written in Alpha assembly language? 


With literal addressing for the Alpha, the embedded constant value is a small positive 
number (0-255). Explain how to achieve the net effect of a negative literal value for 
add and sub instructions. 


Suppose that a computer architecture with a 16-bit instruction word allocates a 4-bit 
field for opcodes and two 6-bit fields for operands. 


a. What is the maximum number of opcodes? 


b. What is the maximum number of locations that can be accessed by direct 
addressing? 


What bit in an Alpha instruction of the operate class appears to be associated with the 
/v qualifier? 


What bit in an Alpha instruction of the operate class appears to be associated with 
scaled add and sub instructions? 


Show that overflow arises when 3 is subtracted from minus 2 in a 3-bit two’s comple- 
ment representation. 


Demonstrate that the algorithm for obtaining bits <127:64> of the product of two 
signed 64-bit integers is 


Result of umulh — Rb * Bit 63 of Ra — Ra * Bit 63 of Rb 
as stated in the text of this chapter. 


Find out what the assembler that you use (Unix or OpenVMS) would do if the plus 
sign in front of RAD at the line labeled first in the RADIX program were 
removed. Also find out whether parentheses (Unix) or angle brackets (OpenVMS) 
around RAD would have the intended effect. 


4.10 What is the hexadecimal value in R2 after these instructions execute? 


a. lda R2,=1 (R31) 
b. ldah R2 ,~1 (R31) 
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4.11 Describe the ranges of constants that can be specified using 1da and 1dah instruc- 
tions with register R31. 


4.12 Reconsider the illustration for setting up multiple base registers using 1dah instruc- 
tions for a large data region, BigData. Explain why it would be better to set up R15 
to an address of BigData+32768 before setting up R10, R11, R12... to integer 
multiples of 64K beyond the value contained in register R15. 


4.13 Suppose that quadword addresses 30000 and 30008 contain hexadecimal values 
01234567 and 89ABCDEF, respectively. What is the value loaded into Ra by the fol- 
lowing instructions? 


a. ldl Ra,effective address 30000 
b. 1dl Ra,effective address 3000C 
c. ldqg_u Ra,effective address 30001 
d. ldq_u Ra,effective address 3000F 


4.14 Suppose that quadword addresses 30000 and 30008 initially contain hexadecimal 
values FEDCBA8 and 76543210 and that register Ra contains the number -1. 
Describe the final contents at addresses 30000 and 30008 after the following instruc- 
tions have executed: 


a. stl Ra,effective address 30000 
b. stl Ra,effective address 3000C 
c. stq_u Ra,effective address 30001 
d. stq_ u Ra,effective address 3000F 


4.15 Suggest why the register in the autoincrement deferred addressing mode advances by 
2 in the PDP-11 but by 4 in the VAX version. 


4.16 Schematically write out an Alpha instruction sequence that would be equivalent to 
the processing in a CISC architecture for a source operand specified as follows: 


a. — (RX) 
b. @- (Rx) 
4.17 Write and test an Alpha program to subtract two 4-element vectors. 


4.18 Explain how one could use the scaled addition or subtraction instructions to good 
advantage for part of the addressing calculations in an application that requires sum- 
ming the first 10 elements of a vector of longwords. 


4.19 Enumerate the multiples of N which can be computed using only one scaled arith- 
metic instruction with the same register Rx in the three operand positions. 


I 
112 Chapter 4 ° Alpha Instruction Formats and Addressing 


4.20 Show two different Alpha instructions that will negate the two’s complement integer 
in register R3, placing the result back into that same register. 


4.21 (Strongly encouraged) Begin using a word processor table or spreadsheet chart to 
make your own composite index of Alpha instructions, opcodes, and function codes 
starting with all of those introduced in this chapter. We have found that persons gain 
a better familiarity with such material if they transcribe the information themselves 
in their own fashion than if they merely find it in the back of a book. Give consider- 
ation to maintaining your lists using at least two orderings, one sorted by the instruc- 
tion mnemonics and another sorted by opcode (primarily) and function code 
(secondarily). 





CHAPTER 5 


Branch Instructions 
and Logical Operations 


I n previous chapters we have developed some funda- 
mental topics in a study of computer architecture and have illustrated them with brief 
Alpha assembly language programs. Yet all of the programs presented thus far have 
been unrepresentative of actual programs in one key respect, namely, that only straight- 
through sequences of instructions have been shown. Realistic programs also contain 
data-dependent decisions that alter the flow of control. 


In a high-level language, control structures may include several varieties of 
loops, case choices, and the i f...then...el se construct. In this and later chapters we 
are going to see how such powerful constructs for designing and imparting orderli- 
ness to computer programs can be built up from very small beginnings, using the 
machine-language branch instructions in combination with certain arithmetic com- 
parisons and logical operations. 


A warning to the pure of heart: For several decades, important advancements in 
the methodology of programming in high-level languages have led to the modern canon 
of a near-total avoidance of the loose cannon of an older programming technique, the 
goto statement of FORTRAN and of early implementations of BASIC. Newer lan- 
guages have even been developed which scarcely include any direct equivalent of 
goto. In actuality, compilers for all high-level languages implement their high-level 
control structures using low-level variants of goto, that is, the branch instructions in 
machine language. Appreciating this paradox, perhaps with amusement but definitely 
with thorough understanding, will be a major take-away lesson from your study of 
assembly language programming. 
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Types of Control Instructions 


Control of logical flow based upon currently calculated conditions gives computer pro- 
grams their extraordinary power and versatility. Procedure calls and subroutine calls 
are important forms of control because they facilitate organizing a program into coher- 
ent modules, but they are perhaps not as essential or as fundamental as the branch 
instructions. A running computer program makes “decisions” by comparing a current 
result against a fixed standard or another computed quantity using some relationship of 
logic, such as equality or greater-than. The machine must have some type of instruction 
that can bring about a different course of program flow for the two opposite outcomes 
of the implicit test, such as less-than-or-equal-to versus greater-than. 

Some now-historical architectures had skip instructions that would increment the 
program counter (PC) either normally or by an extra amount, depending on the test. 
One architecture that was contemporary with the development of the FORTRAN lan- 
guage had a three-way test instruction with space in the instruction word for three 
address values, one of which would be put into the program counter when the tested 
value was negative, zero, or positive. This hardware feature found its way into software, 
in the form of a three-way version of the if statement in the language specification. 

Most commonly, though, computer architectures include a set of branch instruc- 
tions, almost all of which are dichotomous. Depending on which way the test comes 
out, either the branch is “taken” or it “falls through.” If the branch is taken, the program 
counter must be altered, and this is usually accomplished by adding a signed offset con- 
tained within the instruction word. If the branch falls through, the already-updated 
value in the program counter is allowed to control the memory subsystem for fetching 
the next instruction in line, i.e., the instruction that immediately follows the branch 
instruction itself. 


Alpha Branch Instructions 


The format of the Alpha branch instructions (Figure 4.1) includes the standard 6-bit 
field for the opcode, a 5-bit field for the register name (number) that contains the value 
to be tested, and a 21-bit field for the branch displacement. Eight conditional branch 
instructions express various tests based on the value in integer register Ra (Table 5.1). 
We also consider in this section the unconditional branch instruction. We take up in 
later chapters certain other conditional branch instructions that express tests based on a 
floating-point register and certain special branch-format instructions used in conjunc- 
tion with subroutine calls and returns. 
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Table 5.1 Alpha Integer Branch Instructions 


Mnemonic Opcode Purpose 
beq 39 Branch if contents of register Ra is equal to zero 
bne 3D Branch if contents of register Ra is not equal to zero 
blt 3A Branch if contents of register Ra is less than zero 
bge 3E Branch if contents of register Ra is greater than or equal to 
Zero 
ble 3B Branch if contents of register Ra is less than or equal to zero 
bgt 3F Branch if contents of register Ra is greater than zero 
blbc 38 Branch if bit 0 of register Ra is clear (0) 
blbs | Branch if bit O of register Ra is set (1) 
br 30 Branch unconditionally 


a ee 


Consider the following schematic example containing forward and backward 
branches in a short section of program: 


again: opA 

bxx Ry, onward ; Conditional forward branch 
more: OpB 

br Rz,again ; Unconditional backward branch 
onward: 


Assume for the sake of this present discussion that opA and opB are single instructions, 
though more realistically they could be sequences of many instructions. 


In order to support a wide branch range efficiently, the designers of the Alpha 
decided that the 21-bit displacement field would not explicitly store the lowest two bits 
of the actual address displacement. The CPU hardware multiplies the apparent displace- 
ment encoded in this field by 4 and then adds the resultant scaled value to the current 
value in the program counter (PC). 


When the bxx instruction is executing, the program counter has already been 
incremented to point to the instruction at label more. Suppose first that the bxx 
instruction is to be taken. If so, this branch instruction needs to advance the program 
counter (PC) by 2 longwords (8 address units). In other words, the assembler should 
encode the displacement as 2 (0 0000 0000 0000 0000 0010) in the bxx instruction. 
During the execution cycle for the bxx instruction, the number 8 will be added to the 
already updated value in the PC. 
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Conditional Branch Instructions 


The conditional branch instructions occur in four sets of logical opposites. Three sets 
(beq/bne, blt/bge, ble/bgt) express a signed arithmetic comparison of the 
value in register Ra to the number zero, based on the two’s complement convention. For 
instance if register R2 contains the number 3, the instructions bne, bge, and bgt 
would be taken but the instructions beq, blt, and ble would fall through. 

In some contexts an integer represents not an arithmetic value, but rather a Bool- 
ean true/false condition. One convention assigns zero (all bits zero) to false and minus 
one (all bits one) to true. Then a beq instruction would be taken when the tested value 
is false and bne when it is true. Since true/false can actually be represented using only 
one bit, not as many as 64, potential confusion arises from having very many bit pat- 
terns for either true or false. 

The final set of conditional branch instructions (blbc/b1bs) tests only the least 
significant bit of register Ra. Similar instructions also exist in the VAX but not the PDP- 
11. In the OpenVMS programming environment, these instructions are frequently used 
to test a status code returned by procedures in register RO, with the convention that odd 
codes represent success and even codes represent failure or errors. The lowest-order bit, 
of course, distinguishes even numbers (bit 0 is 0) from odd numbers (bit 0 is 1). 


Integer Compare Instructions 


The Alpha provides a set of integer signed and unsigned compare instructions (Table 
5.2) that complement the blbs /blbc conditional branch instructions. These instruc- 
tions share opcode 10 with the add and sub instructions (see also Table 4.1). 


Table 5.2 Alpha Integer Compare Instructions 


a EEUU EEE 


Function 
Mnemonic Opcode Code Purpose 
cmpeq 10 2D Compare signed quadword equal 
cmplt 10 4D Compare signed quadword less than 
cmple 10 6D Compare signed quadword less than or equal 
cmpult 10 1D Compare unsigned quadword less than 
cmpule 10 3D Compare unsigned quadword less than or 


equal 
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The integer signed compare instructions, like the integer arithmetic instructions, 
use registers Ra and Rb as source operands and register Rc as the destination operand. 
They also offer the alternative of an 8-bit unsigned literal instead of a value in register 
Rb for one source operand: 


cmp xx Ra,Rb,Rc j Re <- 1 If “Ra xx Rb" is true 
; else Rc <- 0 
cmpxx Ra, Las, Be ; Ro <= 1 1f "Ra xx Lit" is true 


; else Rc <- 0 


The destination register Rc is set to 1 if the Boolean comparison “Ra xx Rb” (or “Ra xx 
lit”) is true and is set to 0 if that condition is false. 

The designers of the Alpha architecture recognized that not all imaginable com- 
pare instructions are actually needed. Note that “compare less than P,Q” is the same as 
“compare greater than Q,P” and that “compare less than or equal P,Q” is the same as 
“compare greater than or equal Q,P.” As we might expect from the parsimony of a RISC 
design, the possibility of redundancy was avoided here for the case of two source oper- 
ands in registers. But for the case of a literal as one source operand, there is no such 
symmetry supported by the machine instructions. The GT and GE conditions are not 
available because the literal can only occur in place of Rb, not Ra. 

The integer unsigned compare instructions, like the integer arithmetic instructions, 
use registers Ra and Rb as source operands and register Rc as the destination operand. 
They also offer the alternative of an 8-bit unsigned literal instead of a value in register 
Rb for one source operand: 


cmpuxx Ra, Rb, Rc > Re <= 1 if "Ra xx Rb* is trie 
; else Rc <- 0 
cmpuxx Ra, 1lit,Re x; Re <= 1 if “Ra xx Lit" is true 


> else Rc <- 0 


The destination register Rc is set to 1 if the Boolean comparison “Ra xx Rb” (or “Ra xx 
lit”) is true and is set to 0 if that condition is false. Note that “compare less than P,Q” is 
the same as “compare greater than Q,P” and that “compare less than or equal P,Q” is the 
same as “compare greater than or equal Q,P.” 

The preceding paragraph for cmpu instructions reads almost like a photocopy of 
the earlier paragraph for cmp instructions. Why are there these two sets of integer com- 
pare instructions for the Alpha? The answer harks back to the discussion of signed and 
unsigned integers in Chapter 1. If both operands P and Q “look positive” there is no dif- 
ference in results from cmp and cmpu instructions. But if either P or Q “looks nega- 
tive” we need to know whether those bit patterns represent signed quantities (such as 
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numbers) or unsigned quantities (such as addresses). That is, the quadword 
89ABCDEF01234567 is greater than 0123456789ABCDEF if these represent 
addresses, but 0123456789ABCDEF is greater than 89ABCDEF01234567 if these rep- 
resent signed numbers. 

Always use the unsigned forms of these instruction for comparing addresses. In 
the analogous situation with unsigned versus signed branch instructions, many a PDP- 
11 program used to go astray as soon as the machine was fitted with more than 32KB of 
memory! Some inattentive programmers had compared addresses using signed instead 
of unsigned arithmetic, and program flow would go in the wrong direction after the 
memory upgrade. Although it took years longer for this same problem to surface with 
VAX systems, some VAX programs likewise contained latent bugs that really bit when 
affordable amounts of memory reached the halfway point in the numeric representation 
for the address space. 


Unconditional Branch Instruction 


The unconditional branch instruction shares the same layout (Figure 4.1) as the condi- 
tional branches, even though nothing will be tested. Nevertheless, a register must be 
specified. The unconditional branch instruction stores the updated PC into that register. 
This somewhat surprising behavior is related to that of the instruction for transferring 
control to a subroutine, which we will not take up until Chapter 7. In order to avoid 
destroying any useful register contents, we will select register R31 (permanently zero) 
for use with the br instruction. 

The unconditional branch is indeed the machine language goto instruction. Let’s 
think about implementing an if...then..else construct. How is the computer going 
to skip over the entire else sequence of instructions when an if relationship has 
directed the then sequence of instructions to be executed? Consider this schematic 
program fragment: 


pE: cmpeq Rx, value, Rz ; These two lines 

blbc Rz,else ; comprise IF control 
then: sequence for THEN ; Do when Rz = 1 

Dr R31,endif 
else: sequence for ELSE ; Do when Rz = 0 
endif: 


This machine language pseudocode fragment shows, we hope, that a goto almost inev- 
itably lurks inside one of the control structures which is extensively used in high-level 
languages as one means of avoiding any gotos! 
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This is a pretty good illustration of the power of abstraction in computer science, 
where certain details can be buried at a deeper level, with the result that the rules are 
then much cleaner and less error-prone at the higher level of abstraction. In the case of 
high-level languages, the compiler handles the deep-down drudgery without error. 
When we are programming in assembly language ourselves, however, we have to be 
careful indeed. With the pseudocode example just given, as with a real 
if..then..else construct, we must ensure that there is no way to jump from afar 
either into the sequence of instructions for then or into the sequence of instructions for 
else. And that is really what is ordinarily meant by strongly discouraging or “forbid- 
ding” any goto instructions. 


Branch Addressing Range 


The Alpha branch instructions have a rather generous forward and reverse range of four 
megabytes, or one million instructions in each direction (recall that the displacement is 
encoded in a signed 21-bit field). In contrast, both the PDP-11 and the VAX used only 
one byte (8 bits) for a signed displacement for branches. On the PDP-11, the displace- 
ment was construed as a word-length offset, since instructions always consisted of 1, 2, 
or 3 word-aligned information units. This design resulted in a range of -128 to +127 
words, or perhaps 70 instructions on average, forward or backward. On the VAX, the 
one-byte signed displacement is construed as a byte-length offset, since VAX instruc- 
tions consist of variable numbers of bytes in a stream. Typical VAX instructions average 
about four bytes; therefore, the branch range is constricted to perhaps 35 instructions on 
average, forward or backward. Some programmers have felt that this very small range 
carried the concept of enforced locality to an awkward extreme. 


DOT_N: Using a Conditional Branch for Loop Control 


In the previous chapter, we presented a program without a loop that computed the dot 
product of two 3-vectors. A similar program, but with greater generality, would com- 
pute the dot product of two N-vectors. Such a program with the dimensionality N as a 
symbolic parameter at the top of the listing, is presented in Figure 5.1. From this point 
forward, we are going to print in this book only one version (usually Unix) of each sam- 
ple program unless another version (usually OpenVMS) is especially illuminating or 
has to be substantially different. All versions of the programs can be found on the 
accompanying CD-ROM. 
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/* DOT N Scalar product of N-vectors (Unix) */ 
/* This program computes the scalar ("inner", "dot") 
product of two N-component vectors V1, V2. */ 
N = 3 # Dimensionality <= 255 (why?) 
LIES 
YL . quad -1,+3,+5 # 3-vector named V1 
Was . quad -2,-4,+6 # 3-vector named V2 
.text # Section for program code 
.align 4 # Octaword alignment 
.set noreorder # Disallow rearrangements 
.globl main # These three lines 
.ent main # mark the mandatory 
main: = 'main' program entry 
ldgp Sop; 0 (527) # Load the global pointer 
.frame $sp,0,$26,0 # Describe the stack frame 
.prologue 1 # Say that $gp is in use 
-globl first 
first: lda S14 ,Vi # R14 -> vector V1 
lda S15,V2 # R15 -> vector V2 
addq S31, 5231.80 # RO = running sum 
mov (N) ;$3 # R3 = dimensionality 
again: ldq SI (Sia) # R1 = component of V1 
ldq $2, ($15) # R2 = component of V2 
mulg Sl ea, Sk # R1 = V1*V2 (components) now 
addq $180,850 # Update the sum 
addq $14,8,$14 # R14 -> next V1 component 
addq B15,.8,615 # R15 -> next V2 component 
subq S351, 83 # Count down.... 
bgt $3,again + seexsUuntil ReaD 
done: mov 0,$0 # Signal all is normal 
ret So-Ly (S26) «4 # Back to Unix environment 
.end main # Mark end of procedure 


Figure 5.1 DOT_N: An illustration of loop control using a conditional branch 


This more general program DOT_N uses additional registers: R3 for the dimen- 
sionality and loop control, R14 as an address pointer for vector V1, and R15 as an 
address pointer for vector V2. While there is a little more overhead between first 
and again to get everything set up, the heart of the algorithm is simplified because the 
multiply/add sequence occurs only once (inside the loop). Notice that both address 
pointers must be incremented by 8 units in order to advance to the next quadword val- 
ues each time through the loop, while the copy of the dimensionality (i.e., the number 
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of components) must be decremented by one. We could have adjusted the addresses just 
as well using instructions like 1da r4,8(r4). 

Parentheses are needed around N, or alternatively a unary plus operator in front of 
N, in the mov pseudo-instruction just above the label again in order to trigger the 
Unix assembler to commence building a machine instruction using literal addressing for 
the constant. Otherwise, the assembler prints a message that it was expecting a register 
as the source operand. 

As an alternative design, let us briefly consider using the s8addq instruction as a 
means of relating the index for addressing the components to the index for loop control, 
much as one could in using a for loop in C or similar techniques in other high-level 
languages. Making these two index quantities become the same requires recognizing 
that register R3 should now step downward from N-—1 to 0 rather than from N to 1. We 
also need to introduce two additional registers, R4 and R5, as intermediates and then 
rewrite the body of the loop: 


mov (N-1),$3 ; R3 = highest index value 
again: s8addq $3,$14,$4 ; R4 -> component of V1 

s8addq $3,$15,$5 ; R5 -> component of V2 

ldq $1, ($4) ; R1 = component of V1 

ldq S2, (95) ; R2 = component of V2 

mulq Sly S2, $1 ; R1 = V1*V2 (components) now 

addq $1,$0,S$0 ; Update the sum 

subg SE Pe a ; COuntdOWN: sas 

bge $3,again ; w«ssthrough R3=0 (the last) 


In this method, the most natural way to integrate the loop counter with the array index 
has had the effect of reversing the order in which the vector elements are accessed. Of 
course, reversing the order of addition here is immaterial according to standard alge- 
braic rules. On the other hand, if we were printing an ordered list, we would not have 
such arbitrary freedom to modify the directionality of access to data. 

How can we assess whether this alternative, or the original in the DOT_N program 
(Figure 5.1), is preferable? The execution time of a program depends primarily on the 
number of instructions fetched, decoded, and executed. Secondarily, the differential 
execution time is a factor; for example, an instruction requiring access to memory for 
data may be slower in any architecture than an instruction requiring access only to reg- 
isters for data. From the label again up to the bgt or bge instruction, both versions 
have a loop body with seven instructions, two of which load data from memory into a 
register. In this instance the choice would seem to come down to personal preference. 
Yet the alternative just explored uses two more registers than does DOT_N. If other uses 
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in an algorithm put forth a stronger claim for register allocation, the programmer might 
well prefer the original approach that we presented in the DOT_N program. 


Locality and Symbolic Labels 


When we first introduced symbolic labels that refer to the addresses of data or instruc- 
tions, we pointed out the distinction between simply defining the label with a single 
colon and marking it as global by using the . glob1 directive (Unix) or defining it with 
a double colon (OpenVMS). In either case, a symbol must not be multiply defined 
within a certain scope. 

If only a single colon is used, the symbol is local in scope and is “known” or “‘vis- 
ible” only within a single module being assembled. The symbol will not be passed 
along to the linker through an intermediate object file, and thus cannot show up in a link 
map. Such symbolic names can therefore be reused with different meanings in different 
object modules that are linked together. At the same time, the assembler will detect and 
report an attempted multiple definition of a symbol within any one module. 

When a symbol is global in scope, it can be referenced in other modules outside 
the one where it is defined. The symbol is passed to the linker through an intermediate 
object file, and the linker can include it in the link map describing the final executable 
program. The linker detects and reports an error condition if a global symbol 1s defined 
in more than one module. 

Long programs can require huge numbers of labels, primarily as branch targets. 
Since good programming practice emphasizes locality of such transfers of control, a 
branch target seldom needs to be “visible” at great distances. Moreover, it becomes quite 
tedious to concoct numerous different but largely synonymous labels (such as top, 
next, onward, again, loop) that would be unique throughout the whole module. 

An assembler may make available to the assembly language programmer a special 
type of local labels that are temporary. 

For the Unix assembler, a temporary local label is a small numeric value (1- 
255), which can be referenced in a branch instruction by putting a letter £ (forward) 
or a letter b (backward) immediately after the final digit. The assembler looks for the 
nearest such label corresponding to the specified number in the forward or backward 
direction, respectively. 

For MACRO-64 (OpenVMS), temporary local labels have names that begin with 
a numeric character and end with the dollar sign, in the range 0$ through 29999S (the 
range 30000$ through 65535$ is reserved for temporary labels automatically created 
by macros, as mentioned in Chapter 9). Label names are the arbitrary choice of the pro- 
grammer. Temporary local labels need not be assigned in any particular numerical 
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order, but the preferred programming style uses increasing numerical values, e.g., 10$, 
20S, 908; 

When temporary local labels are used, a greater burden falls onto the comment 
field to explain the algorithm and the logical flow of a routine. 

According to the MACRO-64 manual, the scope of a temporary local label is sup- 
posed to be bounded within a local label block by the nearest two ordinary local labels, 
as shown in this schematic fragment: 


top: <instruction sequence A> 
blt Rx,20$ 
10$: <instruction sequence B> 
DE R31,out > EKLE 
208: <instruction sequence C> 
bgt Ry,10$ ; Conditional exit 
out: <instruction sequence D> 


The temporary local labels 10$ and 20$ should be reusable in the program both any- 
where preceding ordinary local label top and anywhere following out. That is, the use 
of temporary local labels should guarantee that those labels cannot be referenced out- 
side their local label block. In the illustration just given, no branch or jump instruction 
outside the block bounded by the local labels top and out should ever transfer control 
to the instruction at 20$ between them. Through such automatic restrictions, an assem- 
bler can exert a role of enforcing modular design when temporary local labels are 
adopted routinely. 

(We found and reported to Digital Equipment Corporation two flaws related to the 
handling of labels in MACRO-64 version 1.1-087, as described in the Suggested 
Resources section at the end of this book.) 


Loop Design and Program Efficiency 


Every instruction inside a loop in a program exacts a cost which is the intrinsic execu- 
tion time of the instruction multiplied by the average number of iterations through that 
particular loop over the whole duration of running of the program. Slicing even just one 
superfluous instruction out of a loop has a bigger favorable impact on performance than 
tinkering with sequences of instructions that are executed only once. When loops are 
nested, the instructions located within the innermost loop have greater impact than 
those located in outer loops. 

Branch instructions can have a special impact on overall performance. The need to 
fetch from an address out of sequence when a branch is taken not only disrupts orderly 
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execution in a high-performance implementation (i.e., a pipelined implementation), but 
also reduces the chance that a copy of the new instruction is already quickly accessible 
to the CPU (i.e., an implementation with one or more levels of cache memory). 

Eliminating one branch instruction from a loop will thus have a bigger effect than 
eliminating one of some other type of instruction. As an example, the top...out pro- 
gram fragment given in the preceding section can be reorganized as follows: 


Lop: <instruction sequence A> ; Always performed 
bge Rx,10$ 
<instruction sequence C> ; Performed when Rx < 0 
ble Ry,out 

LOS3 <instruction sequence B> ; Performed when Rx >= 0 


; OR (Rx < 0 AND Ry > 0) 
out: <instruction sequence D> 


Here we have not changed what the fragment actually does, but we have entirely elimi- 
nated the unconditional branch instruction. The overall logical intent of the routine may 
now be somewhat clearer to comprehend as well. The apparent conditional loop in the 
original was merely a tangle in logical flow that could in this case be streamlined by 
taking out the reverse-pointing branch. This is just common sense. If the actual intent is 
to go forward without looping, then the programmer should subject any branch instruc- 
tion that points backward to special scrutiny. 


Conditional Move Integer Instructions 


The fastest instructions in any computer architecture are those where all necessary oper- 
ands are already contained in registers. Making copies of data in registers is a very com- 
mon requirement in programming algorithms. Conditional branch instructions have 
many applications, but they can retard a program considerably when a sequence is bro- 
ken. Any anticipatory fetching of instructions by the hardware has to be abandoned, and 
fetching of a new instruction sequence must restart at the new address put into the pro- 
gram counter when a branch is taken. 

The Alpha architecture offers an imaginative approach to some of those needs and 
concerns with a set of conditional move integer instructions. These instructions are 
classified as integer operate instructions, either with three register operands or with two 
register operands and an 8-bit unsigned literal instead of a value in register Rb. 


cmovxx Ra,Rb,Rc ; Rc <- contents of Rb if "Ra xx 0" is true 


cmovxx Ra,lit,Rce ; Re <- lit if "Ra xx 0" is true 
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Like branch instructions, however, the contents of register Ra is tested. The particular 
logical test “xx” is encoded in the function code field in the instruction word. If the test 
comes out true, then the contents of register Rb or the 8-bit unsigned literal is copied 
into register Rc. If the test comes out false, register Rc is not modified. 

Conditional move instructions share opcode 11 with the logical functions that we 
also take up later in this chapter. These instructions correspond to the function codes 
listed in Table 5.3. Because they incorporate a logical test without the potential slow- 
down associated with taking a PC-altering branch, these instructions are among the 
most powerful in the Alpha instruction set. 


Table 5.3 Alpha Conditional Move Integer Instructions 


Function 

Mnemonic Opcode Code Purpose 
cmoveq 11 24 Conditional move if Ra = 0 
cmovne 11 26 Conditional move if Ra <> 0 
cmovlt 11 44 Conditional move if Ra < 0 
cmovge 11 46 Conditional move if Ra >= 0 
cmovle 11 64 Conditional move if Ra <= 0 
cmovgt 11 66 Conditional move if Ra > 0 
cmovlbs 11 14 Conditional move if low bit of Ra set 
cmovlbc 11 16 Conditional move if low bit of Ra clear 





Potential applications of these instructions abound. The cmov1bs instruction can 
be used to expand a single bit of information into an 8-bit code from the literal or a 64- 
bit value from register Rb, for example. Three additional examples are the following: 


cmoveq Ra,Rb,Rc is equivalent to bne Ra,skip 
mov Rb into Rc 


cmplt R1; RZ R3 is equivalent to R1 = MAX(R1,R2) 
cmovlbs R3,R2,R1 


cmovne R31,R31,R31 is equivalent to nop 


Here the first cmoveq example shortens the program by one instruction, and avoids a 
conditional branch. The second example implements a common function ordinarily found 
only in high-level languages with a simple sequence of two machine-language instruc- 
tions. The cmp1t instruction sets R3=1 if R1 is less than R2, or equivalently if the condi- 
tion R2>R1 is true; the cmovlbs instruction then copies R2 into R1 only if R2 was 
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indeed greater than R1, since otherwise the previous instruction would have set R3=0. 
The third example is an instruction that does nothing, because R31 is permanently zero 
and thus cannot ever be unequal to zero. Almost every computer architecture has some 
form of no-operation, no-op, or nop instruction. The Alpha does not allocate a unique 
opcode for nop, since a cmov instruction (and actually some other special cases) can 
readily provide an equivalent. Thus nop is an Alpha pseudo-instruction. 


MAXIMUM: Using Conditional Move Instructions 


From this point forward, virtually all of our example programs will contain loops. We 
will illustrate various techniques for the initialization and control of loops in those 
examples, and suggest other ways for your consideration through the exercises. 

FORTRAN and certain other high-level languages have an intrinsic function 
which can take as its input an indefinitely long list of numbers and then return as its 
computed result the algebraically largest value from that list. Certain other high-level 
languages, including some implementations of C, may lack such a function. Histori- 
cally, providing such desirable capabilities when they were missing from a high-level 
language was one of the common applications of assembly-language programming. 

Figure 5.2 presents a program MAXIMUM that returns in register RO the maxi- 
mum value encountered during a scan of integers stored as quadwords in a list. 
Although we do not directly know the number of data values that are stored starting at 
address num, we can use the label endnum as a limiting address. An address pointer 
(register R14) is stepped along by 8 each time through the loop until it reaches the limit 
(register R15). 





/* MAXIMUM Find maximum value in list of integers (Unix) 


/* This program locates the algebraically largest number 
in the list of quadword values at ‘'num'. */ 


»Lits 

init: . quad 0x8000000000000000 # Most negative number 

num: . quad -1,+3,+5,-77,+99,0,-17 

endnum: # Next (unused) location 
ESRC Section for program code 
align 4 Octaword alignment 


Disallow rearrangements 

These three lines 

mark the mandatory 
'main' program entry 

Load the global pointer 


.set noreorder 

.globl main 

.ent main 
main: 


+ + + H+ HH H 


ldgp Sop. 0 (527) 
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.frame Ssp,0,$26,0 # Describe the stack frame 
.prologue 1 # Say that $gp is in use 
-.globl first 

first: Jago SO i Be we Be vi # RO = Largest found so far 
lda $14,num # R14 -> start of list 
lda $15,endnum # R15 -> end of list 

again: cmpult $14,$15,$3 # When R14 = R15 
bibe $3, done # we're done 
ldq Sax (S182) # R2 = Number to inspect 
cmplt 60,32 OL # R1 = 1 1f RO < R2, else 0 
cmovlbs $1,$2,$0 # RO = MAX(R2,R0) now 
lda $14,8($14) # R14 -> next number 
br $31,again # to inspect 

done: mov 0,$0 # Signal all is normal 
ret S31, ($26) «1 # Back to Unix environment 
end main # Mark end of procedure 


Figure 5.2 MAXIMUM: An illustration of conditional move instructions 


The central part of the MAXIMUM program uses the two-element MAX “function” 
just shown above, i.e., a sequence of a compare instruction followed by a conditional 
move instruction. For the program to get going, we need to initialize the comparison value 
to the most negative number which can be represented in 64 bits since we anticipate that 
the comparisons will surely find something algebraically greater in the list. This must be 
done by loading that quadword value into a register (RO) from a memory location. The 
mov-like instructions such as lda and ldah are limited to numbers that can be 
expresssed in fewer bits as source operands and thus would not serve us here. 

When the program runs, each number in the list is successively brought into regis- 
ter R2 and compared against the most recently found largest value. If the current num- 
ber in register R2 is larger, it replaces the previous comparison value in register RO. The 
address pointer, register R14, is adjusted at each cycle through the loop using a load 
address instruction that is functionally equivalent to adding 8 to the register as we did in 
a previous example (Figure 5.1). Using 1da instead of addq emphasizes that the value 
in register R14 is being advanced because it represents an address, not because it repre- 
sents some kind of 8-step counter. While addq is not wrong, 1da seems stylistically 
preferable because it subtly contributes to comprehension and ease of maintenance of 
the program. 

The loop control is structured with the test at the top, much like a while-do con- 
struct in a high-level language. As emphasized previously, it is essential to use an 
unsigned comparison when the two quantities to be compared are addresses, as here in 
the instruction at again. If one were willing to assume that there would be at least one 
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valid item of data at num, then the loop control could be restructured with the test at the 
bottom, much like a do-until construct in a high-level language. In that way, there 
would be only one branch instruction at the bottom leading back to the 1 dq instruction. 
You should think about why that revision could speed up the program. 

In a later chapter, we will show how to adapt routines such as MAXIMUM so that 
they conform to standard conventions for procedure calls and can then be linked to main 
programs written in high-level languages. 


Logical Functions 


At the heart of the central processing unit in every computer is the actual calculating 
component, which is usually called an arithmetic and logic unit (ALU) because it per- 
forms both arithmetic operations and logical functions according to the various opcodes 
encountered as each instruction is fetched. We are splitting our discussion of the large 
class of integer operate instructions for the Alpha across several chapters. We began in 
Chapter 4 with the purely arithmetic instructions. In the present chapter we have 
already discussed several groups of instructions (branches, compares, etc.) that involve 
logical comparisons based on numerical values of operands. Computers can also store 
logical data in their information units. In this section we will discuss a group of instruc- 
tions that implement logical functions that work directly with such data. 

In the nineteenth century, George Boole initiated a new branch of mathematics 
now known as Boolean algebra, in which variables take on only two values that repre- 
sent truth and falsity. Boolean algebra is a complete algebraic system with axioms, con- 
stants, variables, operators, theorems, and so forth. We would stray too far from the 
main purposes of this book if we were to present this algebra in any detail. We will 
instead assume of our readers an intuitive understanding of such logical concepts as 
AND, OR, and NOT. In computer storage, the bit value 1 is commonly assigned to truth 
and the bit value 0 to falsity. (Readers who know the C language will recall that O is 
false and everything else is true when a multi-bit integer is used as a Boolean variable.) 

Suppose we have two independent Boolean (logical) variables, A and B. Since 
each of these can independently take on the values 0 and 1, there are altogether 4 ways 
of finding A and B: 00, 01, 10, and 11. Suppose we envision a function of these two 
variables, F(A,B). What will be the properties of this function? The function can be 
described by stating its value (0 or 1) for each of the four ways of finding its input vari- 
ables, A and B. 

The number of such possible Boolean functions, F,,, of two independent input 
variables, A and B, is sixteen. These 16 possibilities represent the sixteen enumerations 
of four 0 and 1 values, as depicted in Table 5.4 (adapted from Langholz). Two of these 
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functions (Fo, F15) always have constant values irrespective of the values of variables A 
and B. Four of these functions (F3, F5, F1ọ, F12) reflect the unary operations that 
“transfer” A or B (F3, F5) and that complement or reverse the logical state of B or A 


(Fio, F12). 


Table 5.4 Boolean Functions of Two Variables 


A B| Fo 
0 0 0 
0 1 0 
1 0 0 
l 1 0 
Fy Constant 0 
F, AND 
F, 


F, ‘Transfer A 


F; Transfer B 


F, XOR 
F, OR 

F, NOR 
F, NXOR 
Fo NOTB 
Fiy 

F NOTA 
F;3 

F,, NAND 


Fıs Constant 1 


F4 


= O O G 


Fo F3 F4 Fs Fẹ Fy Fa Fə Fig F44 F42 Fig F44 Fis 
l 1 1 l 1 1 


O =- o op 
— = COO 
-K- OF o 
O =.. Oo 
— Re Be OS 
O O O m 
- © O m 


0 Ọ 1 l 1 | 
l I O Q f l 
uv I 0 I @ 1 


O O = & 


and instruction of the Alpha, A AND B 
bic instruction of the Alpha, A AND (NOT B) 


xor instruction of the Alpha, A XOR B 
bis instruction of the Alpha, A OR B 


eqv instruction of the Alpha, A XOR (NOT B) 


ornot instruction of the Alpha, A OR (NOT B) 





The remaining ten functions reflect various binary operations. Six of these ten are 
implemented as binary logical instructions in the Alpha architecture, as shown in Table 
5.4. These instructions perform logical functions on 64 bits in parallel, e.g., A AND B 
consists in detail of Ag AND Bo, Aj AND B}, ..., A63 AND B63. At each bit position, 
the resultant bit is 1 if and only if the source bits are 1 in both A and B. 

The Alpha architecture perpetuates the names for three logical functions (bis, 
bic, xor) from its predecessors, the PDP-11 and the VAX, and adds three more logi- 
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cal functions (and, eqv, ornot). The older architectures also had a bit test instruc- 
tion, bit, which was actually an and that did not store its full N-bit result but instead 
set a single-bit summary code to 1 if the 16-bit result (PDP-11) or 32-bit result (VAX) 
contained any 1 bits at all. The Alpha does not have such a bit instruction. 

The logical functions of the Alpha operate with 64-bit Boolean data in three regis- 
ters, or, like other groups of instructions in the integer operate class, with one value as 
an 8-bit literal instead of a full 64-bit value in register Rb: 


mnemonic Ra,Rb,Rc > Rc <- Ra contents OP Rb contents 


mnemonic Ra,lit,Rec > Rc <- Ra contents OP lit 


The value in register Ra is best regarded as the variable or unknown value being manip- 
ulated. The value in register Rb will often be a known quantity, a reference, a bench- 
mark, or a logical “mask” (explained below). 

Logical functions share opcode 11 with the arithmetic and conditional move 
instructions. The logical functions correspond to the particular function codes listed in 
Table 5.5. 


Table 5.5 Alpha Logical Functions 


Function 

Mnemonic Opcode Code Purpose 
and 11 00 Logical AND (logical product) 
bis 11 20 Logical OR (logical sum) 
xOr 11 40 Logical XOR (logical difference) 
bic 11 08 Logical AND with complement 
ornot 11 28 Logical OR with complement 
eqv 11 48 Logical NXOR (logical equivalence) 





The bic, ornot, and eqv functions first complement (or reverse all the bits of) 
the value in register Rb or the literal, and then perform a related operation (and, or, 
xor) with that modified value and the value in register Rc. 


bic Rā, Rb, Ré ; Rc <- value in Ra AND 

; (NOT lit or value in Rb) 
ornot Ra; RD; RC ; Rc <- value in Ra OR 

+ (NOT lit or value in Rb) 
eqv Ra, Rb,Rc > Rc <- value in Ra XOR 


; (NOT lit or value in Rb) 
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When an 8-bit literal is involved, the high-order 48 bits of the complemented second 
operand will become 1 in this process. These functions do not negate the result of the 
related logical function. That is, the Alpha ornot function is not equivalent to the 
more common NOR Boolean function, and the Alpha bic logical function is not equiv- 
alent to the more common nand Boolean function (Table 5.4). Among all the logical 
functions in the Alpha instruction set, only xor and eqv are totally complementary. 


Use of the logical functions sometimes involves the concept of applying a mask to 
logical data. The bis (bit set) instruction, for instance, has the effect of setting bits into 
the value in register Ra at the positions where the value in register Rb or the literal has 1 
bits, and then putting that result into register Rc. The bic (bit clear) instruction, by 
contrast, has the effect of clearing bits from the value in register Ra at the positions 
where the value in register Rb or the literal has 1 bits, and then putting that result into 
register Rc. The xor (exclusive or) instruction has the effect of toggling the bits of the 
value in register Ra at the positions where the value in register Rb or the literal has 1 
bits, and then putting that result into register Rc. Assume original binary contents 
10101010 in register Ra and the binary literal 00110011 as the mask: 


Ra: 10101010 Ra: 10101010 Ra: 10101010 
lit: 00110011 lit: 00110011 lit: 00110011 
bis: 10111011 bic: 10001000 xor: 10011001 


With bis (logical sum), the 1 bits in the mask replace 0 bits from the value in register 
Ra in forming the result. With bic, the 1 bits in the mask prevent 1 bits from the value 
in register Ra from being copied into the result and put 0 bits in those places. With xor 
(logical difference), the result shows 1 bits at positions where the value in register Ra 
differs from the value in the mask. 

One very common application of bis and bic instructions is to convert between 
the upper-case and lower-case ASCII alphabetic character codes because these differ 
consistently by just one bit, 001000002 or 2016 or 329 (see Table 2.3). Assume that the 
code for an alphabetic character is in register Ra: 


bis Ra, 32,Re ; Force character to lower case 


bic Ra ,32,Re ; Force character to upper case 


Recall that the default radix for constants is decimal for the assemblers. We could also 
specify the literal as 0x20 (Unix) or *x20 (OpenVMS). What would a similar xor 
instruction do to the code for an alphabetic character? 
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The Unix assembler accepts or as a synonym for the bis instruction, but the 
OpenVMS assembler does not. In order to emphasize similarities rather than differ- 
ences between the assemblers, we will use bis as the opcode for logical OR. 


Shift Instructions 


Computer architectures usually provide instructions that shift a bit pattern to the left or 
right. The first implementations of the PDP-11 offered only shifts by a single bit posi- 
tion, while many 32-bit computers including the VAX offer shifts by multiple bit posi- 
tions. The Alpha offers left logical, right logical, and right arithmetic shift instructions: 


sal Ra, Rb,Rc : Rc <- contents of Ra shifted Rb bits 
Sxl Ra, Lit, Re + Rc <- contents of Ra shifted lit bits 
sra Ra, Rb,Rc + Rc <- contents of Ra shifted Rb bits 
sra Ra, lit,Re > Rc <- contents of Ra shifted lit bits 


where x = 1 for a shift left logical (s11) and x = r for a shift right logical (sr1). The 
logical shift operations fill in vacated bit positions with 0 bits. The shift right arithmetic 
operation (sra) fills in vacated positions with either 0 or 1 bits that will replicate the 
sign bit of the original value (bit 63 of Ra). The amount of these shifts can be 0-63 bits 
in the literal or in register Rb. A design choice was made here to specify the direction 
for the logical shift using separate opcodes, rather than to specify the direction by mak- 
ing the shift argument a signed quantity. Some other architectures adopt the latter 
choice instead. 

The shift instructions share opcode 12 with the byte-manipulation instructions, 
which will be taken up in Chapter 6. These instructions correspond to the function 
codes listed in Table 5.6. 


Table 5.6 Alpha Integer Shift Instructions 


Function 
Mnemonic Opcode Code Purpose 
srl I2 34 Shift right logical 
sll 12 39 Shift left logical 
sra 12 Me Shift right arithmetic 





The shift right arithmetic instruction (sra) divides a signed integer value by 2 for 
each bit position shifted. The remainder of this simulated division operation is lost, i.e., 
3/2 and 2/2 both yield the result 1 for this integer division. Thought exercises for the 
reader: What happens for -3/2, -2/2, and -1/2 with the sra instruction? What type of 
Alpha shift instruction should be used to divide an unsigned integer by 2? 
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There is no shift left arithmetic instruction on the Alpha. The shift left logical 
instruction can be used when the intent is to multiply by powers of 2; however, overflow 
can potentially occur (i.e., result has a change of sign) and is not checked by the hard- 
ware in any way for this instruction. The programmer must plan an algorithmic design 
robust enough to be immune from this situation. 

Caution: In some architectures, a shift instruction may have special properties 
such as retaining the out-shifted bit in a dedicated storage location, inserting such a bit 
instead of a zero, or rotating the bit pattern circularly. In the Alpha architecture, how- 
ever, the out-shifted bits are simply lost. 


RADIX2: Using Logical and Shift Instructions 


We will now apply one of the logical functions and one of the shift instructions in a fur- 
ther adaptation of the RADIX program (Figure 5.3). For this new version, we will 
encode the input digits as ASCII characters and store them in a single quadword. Since 
the load instructions of the Alpha can only access either longword- or quadword-length 
information units, not individual bytes, we will use logical masking and shifting tech- 
niques in a program loop to isolate the digits. Each digit will require conversion from 
ASCII representation to the raw numeric equivalent. 

ail es etter 


/* RADIX2 Number Conversions (Unix) */ 


/* This program will convert a positive number, expressed in 
radix RAD digits stored as characters beginning at D2, into a 
value in RO. */ 


RAD = LO # Define a symbolic constant 
.data 

D's . byte HO" "1", %42",0,0,0,0,0 # Characters for 1914" 
.text # Section for program code 
-align 4 # Octaword alignment 
.set noreorder # Disallow rearrangements 
.globl main # These three lines 
.ent main # mark the mandatory 

main: = 'main' program entry 
ldgp Sgp,0($27) # Load the global pointer 
.frame Ssp,0,$26,0 # Describe the stack frame 
.prologue 1 # Say that $gp is in use 
Gl6bl Siget 

first: ldq S2; D2 # Get 8 characters 


cmoveq $31,$31,$0 # RO = 0 (holds the number) 
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more: cmovegq $31,$2,$1 # Copy of (shifted) bytes 
and Sle ORE TL 7 Sil # Just one character 
beq $1,done # Null code marks end 
bic SOs Oy SL # Convert from ASCII to value 
mulg $0,RAD, $0 # Multiply previous by 10 
addq SL.«o0, 80 # Add new digit value 
srl C2, 8,2 # Discard current byte 
br $31,more # Go look for more 

done: mov 0; $0 # Signal all is normal 
ret S31, (S26) # Back to Unix environment 
.end main # Mark end of procedure 


Figure 5.3 RADIX2: An illustration of logical and shift instructions 


We draw your attention to several features of this program that not only illustrate 
the different instruction types introduced in this chapter, but also represent techniques 
that will recur in future examples in this book. First, cmov instructions are used in two 
ways, for initializing register RO to zero and for copying a value from one register to 
another (at label more). We could have used the pseudo-operation mov, but the assem- 
bler would convert that into something very like these actual cmov instructions. Sec- 
ond, an and instruction with a one-byte literal mask of all 1 bits is used to isolate the 
low-order byte of a quadword. Third, we are using a common convention of marking 
the end of a stored string with a null byte (ASCII code 0, not the printing character 
zero); this special code triggers an exit from the program loop when there are no more 
valid ASCII numeric characters to process. Fourth, we use a bic instruction with the 
appropriate mask 0x30 to clear away the two specific bits that occur in the ASCII 
codes for numeric characters (see Table 2.3). Finally, we use a logical shift instruction 
to discard the byte just processed; the next byte then slides into the lowest-byte position 
within the quadword in register R2. 

If you are able to study this example at the computer, you should monitor the con- 
tents of registers RO through R2. Watch the values in those registers change as you step 
through the loop until it exits to the line at done. You will see the logical masking and 
shifting operations very clearly. Is the final numeric value in register RO correct? 

As you study our examples and continue developing your own short routines and 
programs, you will gain confidence devising good organizations for loops. In this exam- 
ple, the test for the exit seems to be in the middle, not at the top (as expected for 
WHILE condition ... DO) and not at the bottom (as expected for DO ... UNTIL condi- 
tion). This is partly an illusion because it takes three Alpha instructions at the label 
more to set up the test (cmoveq, and, beq). If we were to construe those three 
instructions as a higher level abstract construct, then we would indeed perceive this pro- 
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gram as containing a form of WHILE condition ... DO loop control. The unconditional 

branch instruction marks the lower extremity of the loop. If we were programming in 
something like Pascal here, the bic, mulq, addgq, and srl instructions would consti- 
tute the body of a loop inside BEGIN ... END bracketing. Of course Pascal might not 
naturally produce data alterations as part of the condition evaluation, whereas here our 
three Alpha instructions at the label more also set up register R1 with appropriate data 
for the current cycle. 


Integer Division Revisited 


When we introduced integer arithmetic instructions in Chapter 4, we pointed out that 
the Alpha architecture lacks an integer divide instruction. As Bhandarkar has noted, 
several methods can be used to implement integer division: 


° use of floating-point hardware to convert from integer into floating-point repre- 
sentation, to divide, and then to convert back. The precision of the floating-point 
representation is only 53 bits, however, not a full 64 bits (see Chapter 2); 

e iterative testing and subtraction to implement algorithms analogous to decimal 
long division; and 

e multiplication by the reciprocal using the umu1h instruction, and then a shift 
instruction to normalize the result. 


This last method, using a reciprocal, is especially applicable for division by a constant 
that is fully known at the time the program is assembled. 

Consider the very common need to divide a binary integer by 10 as part of an 
algorithm for formatting decimal numbers before printing them. The reciprocal of 10 is 
the repeating fraction 0.0001100110011... (binary). In order to preserve maximum pre- 
cision, we can instead use the normalized 64-bit fraction 0.CCCCCCCCCCCCCCCD 
(hexadecimal representation of 0.8) for the multiplication, with a subsequent division 
by 8. The final hexadecimal digit of the fraction is D rather than C in order to ensure 
normal rounding, since shifting as a means of division will otherwise truncate rather 
than round. According to standard axioms of algebra, this division by 8 can be deferred 
until after obtaining an extended-precision product formed through multiplication of 
the source number by the normalized fraction. That is, we can use the simple mathemat- 
ical identities x/ 10 = 0.1 x=0.8 x/8. 

Let’s explain this method in more detail. Since one-tenth is less than one-eighth 
but greater than one-sixteenth, the representation of 0.1 (decimal) as a binary fraction 
begins with three zero bits before the first one bit (that is, no one-half, no one-quarter, 
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no one-eighth, one one-sixteenth, etc.). The full binary representation for one-tenth is 
then found, by a continuation of this process, to contain a repeating bit pattern 
_..1100... which we know corresponds to the hexadecimal character C. With binary 
numbers and fractions, each shift to the left by one bit position corresponds to a multi- 
plication of the value by a factor of two, and each shift to the right by one bit position 
corresponds to a division of the value by a factor of two. If we shift the bit pattern repre- 
senting one-tenth three bit positions to the left, so that those first three zeros are gone, 
the value becomes 23 = 8 times greater; thus the resulting 0.CCC... hexadecimal pattern 
represents the value eight tenths. 

When we wish to divide a value x by ten, we may first multiply that value by 0.8 
to form an intermediate product 0.8 x of highest precision. Then we may divide the 
intermediate product 0.8 x by 8 to obtain a result equivalent to x / 10. Division by 8 is 
easily accomplished on any computer that supports an arithmetic shift to the right by 
three bit positions. 

Now let’s express this algorithm using Alpha instructions. At first, we assume that 
the source number in register Rx is already known to be positive. We elect to use the 
umu1h instruction, which handles only unsigned quantities directly. 


(data section) 


DOTS: . quad OxccceccccccccccccD # 0.8 (finite approx.) 
(code fragment) 
ldq $2,DOT8 # Get reciprocal factor 
umulh Sx, Say Si # RI = high 64 bits 
srl Sigh 5 Sol. # Divide by 8 


We now have the desired quotient in register R1. The original source number is still in 
register Rx. If we are doing a modulo calculation and need to know the remainder, we 
can go on to compute it from Rx — 10*R1. 

Here is an easy walk-through example. What is the result of integer division of 
256 by 10 using this program fragment? Since 28 = 256, the multiplication of DOT8 by 
256 will result in a 128-bit product based on a shift to the left by 8 bits, or 2 hexadeci- 
mal digits, where 


low-order 64 bits are: CCCCCCCCCCCCCD00 (hexadecimal) 
high-order 64 bits are: 00000000000000CC (hexadecimal) 


Remember that a mulq instruction would retain the low-order 64 bits, while a umulh 
instruction would retain the high-order 64 bits (in register R1 in our illustration). Con- 
tinuing, we now divide the high-order portion by 8, 
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CETE = 1100 11005 
0001 10019 = 1946 = 2510 


and obtain the proper result of 25. (Integer division discards any remainder.) 

How should we proceed if the source number may be either negative or positive? 
To provide for more general situations, we need to work with the magnitude of the 
number, but preserve knowledge of its sign so that it can be reintroduced at the end. One 
way to do this using DOTS as before would be as follows: 


subg S31, Sy pa. -RX 


= 
cmovge $x, $x,$1 # Use original Rx? 

ldq $2,DOT8 # Get reciprocal factor 
umulh Sx, S2yp oe. # RI = high 64 bits 

srl $1,323,990 # Divide by 8 

subq Sat, 50,51 # -RO 

cmovge $x,$0,$1 # Assure correct sign 


Remember that subtracting a quantity from register R31 is equivalent to negation of a 
two’s complement number. The two conditional move instructions accomplish the 
potential interchanges without the need for branch-arounds. 


DECNUM: Converting an Integer to ASCII Decimal Format 


When we discussed number systems briefly in Chapter 1, we did not dwell on detailed 
methods for conversion of numbers from one radix to another. The formatting of 
numeric data as strings of ASCII characters for printing, usually in decimal radix, is 
especially important. In this section, we will develop one method for decimal number 
formatting that illustrates many of the instructions and techniques presented in this 
chapter, although it is limited to unsigned 8-digit numbers. In a later chapter we will 
develop a more general approach. 

If we start with a numeric value such as FE;6, how do we systematically convert 
this to 25410? If we divide the value by 10, we get a quotient of 25 and a remainder of 4. 
If we divide that first quotient by 10, we get a second quotient of 2 and a second 
remainder of 5. We can repeat this process until we get a quotient of 0 and a final 
remainder of 4. The sequence of computed remainders (4, 5, 2 in this example) com- 
prises the digits of the base-10 result we want, but in reverse order from the way we 
would print them (2, 5, 4 in this example). In order to put those characters into the 
proper order, we could use some form of last-in first-out (LIFO) stack. We briefly 
touched on the type of stack addressing applicable to certain other computer architec- 
tures in Chapter 4, and we will discuss stacks based on memory addressing in the Alpha 
in a later chapter. For now, we will use the 8-byte capacity of a register in conjunction 
with the capabilities of the logical shift instructions for an 8-digit stack. 
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A brief program using these ideas is presented in Figure 5.4. As you study this 
program, preferably at the computer using the debugger, you should realize its limita- 
tion to positive values that would convert to no more than 8 printable decimal digits. 
There are no tests that enforce this limit in our simple program. If you monitor registers 
RO through R5, you will be able to follow the progress of the algorithm. Each digit of 
the result is produced, converted into an ASCII code, and put into register RO using a 
bis instruction (logical OR). 
ili 


/*  DECNUM Convert integer to ASCII (Unix) 


/* This program converts a positive number from N1 
into a string of ASCII-encoded decimal digits at Al. */ 


.data 
DOTS : . quad Oxccccccecceccececcececd # 0.8 (finite approx. ) 
N1: . quad 0x123456 # Number to convert 
. Comm Al, B # Space for result 
.text # Section for program code 
.align 4 # Octaword alignment 
.set noreorder # Disallow rearrangements 
.globl main # These three lines 
.ent main # mark the mandatory 
main: = 'main' program entry 
ldgp Sgp,0($27) # Load the global pointer 
.frame Ssp,0,$26,0 # Describe the stack frame 
.prologue 1 # Say that $gp is in use 
.globl first 
first: ida $2, DOTS # Get reciprocal factor 
lda SS NL # Get number to convert 
new: cmoveq $31,$31,$0 FE RO = 0 to start 
LO: umulh $3, 52,81 # R1 = high 64 bits 
srl By ou oe # R1 = quotient = R3/10 
mulg $1,10,$4 t RA = 10*(R3/10) 
subg $3,54,$5 # R5 = remainder 
bis 65, 0230;35 # Make into ASCII char 
bis $5,550, 50 # Store the character 
beq $1,store # Done if 0 quotient 
sll $0,8,$0 # Shift left by one byte 
cmoveq $31,$1,$3 # Move quotient to R3 
br 10b # There is more to do 
store: stq $0,A1 # ASCII result 
done: mov 0,$0 # Signal all is normal 
ret TERE (S28) p) # Back to Unix environment 
.end main # Mark end of procedure 


Figure 5.4 DECNUM: An illustration of integer division by 10 
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Give special thought to the position of the beq instruction that implements the 
exit test from the program loop. Why is it precisely there, and not anywhere else? We 
recommend the following debugging commands (system response not shown): 


Unix OpenVMS 

(dbx) store:/2i 

(dbx) /done: 

(dbx) stop at 37 DBX> set break done 

(dbx) run DBX> go 

(dbx) px Sr0 DBX> examine/hexadecimal r0 
(dbx) S$r28/s DBX> examine/ascii A1:A1+7 


The Unix assembler generates a two-instruction sequence at store setting up scratch 
register R28 as a pointer for accessing the information unit where the result is stored. 
The printed string should be 1193046 (decimal equivalent of hexadecimal 123456). 

A seemingly superfluous label new has been carefully positioned just above the 
start of the program loop. Several of our future examples will be similarly designed so 
that truly global initializations (e.g., loading the fraction 0.8 into register R2 in DEC- 
NUM) are set apart from case-specific (re)initializations, such as ensuring that a regis- 
ter which accumulates a result is properly zeroed (e.g., RO in DECNUM at label new). 
Those details of organization then permit a program to be adapted using a main outer 
loop. Each time through, the program can prompt a user or operator for case-specific 
input, process that case, produce corresponding output, and return to the prompt repeti- 
tively until some agreed-upon exit condition occurs. 


Summary 


This chapter has discussed the integer branch, compare, conditional move, logical, and 
shift instructions of the Alpha architecture. Knowledge of how branch instructions work 
has opened up a wider universe of still-simple algorithms for example programs and 
exercises. Logical flow in a program at the assembly language level has been contrasted 
with control structures typically found in high-level languages. The concept of bit 
masks has been illustrated by the distinction between the value and the ASCII character 
representation of a decimal digit. The continued use of the debugger has been encour- 
aged in order to visualize the successive actions taking place in some simple algorithms 
that will serve as building blocks for later illustrations and exercises. 
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EXERCISES 


5.1 Explain how the Alpha branch displacement leads to a branching range of one mil- 
lion instructions in both directions. 


5.2 Explain the execution of an unconditional branch instruction whose displacement 
value is -1. 


5.3 Explain how to use a branch instruction to bring the Alpha program counter into 
view. Test your idea using the debugger. 


5.4 Explain whether cmp and cmpu instructions give identical results if the two source 
operands both “look negative.” 


5.5 Suggest why the Alpha architecture does not have: (a) a cmpne instruction; (b) a 
cmpueg instruction. 


5.6 Write a program for vector subtraction, that is, one that subtracts corresponding com- 
ponents of two lists of numbers. Store the result of VI-V2 in a third vector V3. 


5.7 (OpenVMS only) Explain why the 1da instruction which sets r14 as a pointer to 
vector V1 at first in the DOT_N program must come ahead of the 1da instruction 
which sets r15 as a pointer to vector V2. 


5.8 Write an Alpha instruction sequence equivalent to MIN(R3,R4), depositing the mini- 
mum value in register RO. 


5.9 What does the instruction cmoveq R31,R31,R31 do? 
5.10 What single instruction will copy data from register R5 to R7? 


5.11 Modify the MAXIMUM program so that it reports both the greatest number found 
and the index position at which that value occurs in the array NUM(). 
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5.12 Write a program that finds the least value in an array of signed integers stored as 
longwords. For loop control, use a method based on address pointer comparison. 


5.13 Suppose that registers Ra and Rb contain hexadecimal values 12345678 and 
9ABCDEFO, respectively. What would be the results computed and stored in register 
Rc by each of the Alpha logical functions in Table 5.5? 


5.14 What is the net result of the following instruction sequence? 


or R1,R2,R2 
XOY R2,R1,R1 
XOF Ri R2, Re 


5.15 Devise examples to show that bic R2, R3 ,R4 does not produce R4 = R2 NAND 
R3 and that ornot R2, R3, R4 does not produce R4 = R2 NOR R3, in general. 


5.16 The PDP-11 and VAX architectures contained a com instruction that produced a log- 
ical complement, i.e., turned all 0 bits to 1 and all 1 bits to 0. Explain how to do this 
with a particular Alpha instruction. 


5.17 Find out how the Alpha assembler that you are using responds to an attempt to use a 
literal greater than 63 with the shift instructions. Similarly find out how the Alpha 
CPU responds to an attempt to use a value greater than 63 in register Rb with the 
shift instructions. Describe your findings. 


5.18 Refer to the discussion of division by 10 in the text. Write out and explain the addi- 
tional instruction(s) that will compute the remainder in register RO, assuming posi- 
tive operands. 


5.19 Modify the DECNUM program to handle negative numbers of up to 6 digits in their 
base 10 representation. Can you make it print a leading minus sign if and only if a 
number is negative? 


5.20 Design a routine for computing the polynomial N? + N based on the finite differences 
algorithm for cubes, following these revised specifications: 


a. Provide a quadword for M, where M is the maximum number of instances to be 
computed. 


b. Apart from M, the program must use dynamic instead of assembly-time (re)ini- 
tialization. 


c. Let label top denote the first executable instruction beyond truly global initial- 
izations. 


d. Use a central loop for the repetitive part of the algorithm. Think carefully how 
many times this loop should run in relation to M. 


e. Store the results in successive quadword memory locations beginning at label POLY. 


f. Design the program to take up the least possible storage for executable instructions. 
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Test the program using the debugger with M=8. Inspect the contents of the reserved 
storage locations to demonstrate the validity of your program. 


5.21 A table of consecutive memory locations is called PTRTBL. The length of the table 
(i.e., the number of slots) is unknown, but the contents of each entry in the table is 
known in advance to be: 


e an address pointer to longword-length data (an even, non-zero number); 
e zero, marking an empty slot; or 

e the value -1, marking the end of the table. 

Write a routine that passes through PTRTBL only once and counts: 

a. the number of valid address pointers; and 

b. the total number of occupied and empty slots, exclusive of the -1 marker. 


The routine should copy these tallies into quadword locations labeled ACTIVE and 
TOTAL just before it finishes. Use the following data for the first actual run: 


PTRTBL: .quad 0, 4000, 0, 0, 0, Ox8ACE642012345678, 5276, -1 
Conduct a second run using about a dozen entries of your own. 


5.22 (Strongly recommended) Incorporate all of the additional Alpha instructions from 
this chapter into your personal summary chart(s). 





CHAPTER 6 


Working with Bytes on 
the Alpha 


D igital computers began as instruments for automat- 
ing calculations in military applications and in scientific research. Information units in 
these early computers tended to be dozens of bits in width in order to support algo- 
rithms that needed high numerical precision. We may imagine that when people used 
these systems to compute solutions to practical engineering problems, in the form of 
tables of numbers, just getting those desired results would have seemed to be a hard- 
won but satisfying achievement. Perhaps little thought was given to having the com- 
puter also print descriptive headings for those columns of numbers, since a typewriter 
could adequately perform that small additional task. 


Very early in the twentieth century, long before computers were invented, tabulat- 
ing machines came into use for such applications as processing demographic data from 
the census and printing address labels or payroll checks. Those electromechanical 
devices could sort stiff paper cards according to patterns of holes that represented print- 
able characters, i.e., one character for each of up to 80 columns on a card. Because of 
the rather large size of the holes, there were only 12 rows of holes in each column: 
therefore, such columns of holes could encode only fairly small information units. 

One of the tabulating machine companies eventually transformed itself into a 
major computer company, the International Business Machines Corporation. Punched 
cards are now obsolete, but non-numeric applications of computers based on the repre- 
sentation of character data seemingly continue to grow in importance. 

Since about 1980, all computers have standardized on the 8-bit byte as the small- 
est practicable information unit typically Supported by the architecture through the 
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instruction set. Larger supported sizes of information units are almost invariably multi- 
ples of the byte. 

Any reduced instruction set computer (RISC) design represents compromises 
adopted for the sake of achieving performance goals without seriously eroding functional- 
ity for the assembly language programmer or compiler writer. The designers of the Alpha 
retained byte-addressability of information units in memory, but did not provide any 
direct byte-level access to data. Instead, they limited the load/store instructions to long- 
word and quadword information units. Nevertheless, the Alpha instruction set contains a 
group of instructions specifically tailored to working with one or a few bytes of data in 
register-to-register operations that do not suffer from the timing penalties of memory 
accesses. We turn to those instructions and some of their applications in this chapter. 

Later, the Alpha architecture was extended to include 8- and 16-bit load and store 
instructions, which are discussed in Chapter 13. 


Extract Byte Instructions 


At the end of Chapter 5, we demonstrated one way to isolate the bytes from a larger 
information unit and how to pack computed byte data into a larger information unit, 
using shift instructions. Some of the byte-manipulation instructions provided in the 
Alpha instruction set also shift data, but the magnitude of the left/right shift ranges from 
0 to 7 bytes rather than 0 to 63 bits. Indeed, most of the instructions to be introduced in 
this chapter share opcode 12 with the shift instructions. 

Typically the byte-manipulating instructions combine shifting, zero-filling, and 
selective copying of data from source register Ra to destination register Rc in very spe- 
cific ways governed by the opcode, the function code, and the value in register Rb or a 
literal. Actually, only the lowest 3 bits of the value in register Rb or the literal are con- 
sidered. If we call this 3-bit shift field s, the range can specify a magnitude of shift from 
0 to 7 bytes. 

We begin with the group of extract byte instructions (ext) and their relationship 
to loading bytes of data. Seven different mnemonic opcodes are distinguished by vari- 
ous values in the function code field for these instructions in the integer operate class, as 
given in Table 6.1 The assembler syntax iS: 


exttl Ra,Rb,Rc i fe by, W dy Ñ 
exttl Ra, lit, Re < fe p; We diy € 
extth Ra, Rb, Rc ‘few, Ly d 
extth Ra, 11¢.,Re s¢aw, dy © 


where the lowest three bits found in register Rb or the literal determine the shift amount 
in bytes that the source value in register Ra will be shifted as it is copied into destination 
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register Rc. For opcodes ending in 1 (low), the shift is to the right by s byte positions 
with fill on the left of an all-zero byte for each byte shifted off the right end. For 
opcodes ending in h (high), the shift is to the left by the complementary amount, 8 — s 
byte positions rather than by s bytes, with fill on the right with all-zero bytes. 


Table 6.1 Alpha Extract Byte Instructions 


Mnemonic Opcode Function Code Purpose 
extbl 12 06 Extract byte low 
extwl 12 16 Extract word low 
extll iz 26 Extract longword low 
extql 12 36 Extract quadword low 
extwh iz SA Extract word high 
extlh 12 6A Extract longword high 
extgh 12 7A Extract quadword high 


For example, if s = 5, and the bytes in register Ra are labeled a to h (right-to-left) 
as shown in the top line in Figure 6.1 (designated :Ra), then the shifted value placed in 
register Rc for an extract opcode ending in 1 would be 00000hgf, where the five 
zeros at the left represent all-zero bytes. If the same value of s were used but the extract 
opcode ended in h, the result would be edcba000, since the shift amount would be 8 — 
5, or 3. The actual resulting value in the destination register Rc may have additional 
bytes zeroed, as we now explain. 
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The second aspect of the extract (ext) instruction is what size of information unit 
to extract and retain in the destination register. The character t in the opcode indicates 
what portion of the shifted and zero-filled data (e.g., one byte, one word, one longword, 
or one quadword) should be copied into the destination register Rc. Additional zero fill- 
ing may occur at the left side of the destination register. This will be 7 bytes of zero for 
t = b, 6 bytes of zero for t = w, and 4 bytes of zero for t = 1. In effect, the character t 
in the instruction mnemonic represents a mask that is a byte, a word, a longword, or a 
quadword in width that is anchored to the right side of the register. For a shift parameter 
s = 0, a source value Ra = hgfedcba would yield 0000000a for a byte extract, 
000000ba for a word extract, 0000dcba for a longword extract, and hgfedcba for 
a quadword extract. 

When the shift property is combined with the masking to a particular size of infor- 
mation unit, the ext instruction offers 56 different combinations stemming from the 
seven instruction types (e.g., ways to copy portions of the shifted data) and the eight 
values for the magnitude s of the shift. We can schematically diagram one set of 7 cases 
for the value 5 for the shift parameter, as shown in Figure 6.1. Note that h (high) and 1 
(low) are from the perspective of the source register Ra, not the destination register Re. 
Note also that the information unit size t is from the perspective of the destination regis- 
ter Rc, not the source register Ra. 

Following an 1dq instruction, a single ext t1 instruction can perform the equiv- 
alent of any load of a smaller contained information unit (bytes, words, longwords), 
transferring the proper segment from the source quadword into the low end of a destina- 
tion register filled with zeros at its high end. Thus if the programmer or compiler writer 
is willing to manage the addressing offsets for the shift parameter, the Alpha can access 
smaller information units packed into aligned quadwords very efficiently. 


Loading Unaligned Data 


When we introduced load and store instructions in Chapter 4, we explained that RISC 
systems like the Alpha perform optimally if all data elements can be stored in memory 
using natural alignment boundaries. Despite this performance-oriented preference for 
alignment, some applications or system routines may have to be written to anticipate 
handling the worst case of unaligned data. The Alpha instructions for loading or storing 
unaligned quadwords mask away the three lowest-order address bits and actually send 
an aligned address to the memory subsystem. Accordingly, a single load or store opera- 
tion may not access the full information unit being sought. Nevertheless, relatively short 
sequences of byte manipulation instructions in combination with two load or store 
instructions can handle most of the general cases. 
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Suppose that we are using register R11 as the base register for the byte address 
and that the quadword datum we seek is at offset Z from the address in register R11, 
and unaligned, with its 8 bytes symbolically given as HGFE DCBA. The load unaligned 
instruction 1dq_u computes the effective address (EA) by adding the sign-extended 
offset to the address value in the register and then masking out the three least significant 
bits. This changes the “any byte” EA into a quadword aligned value. For example, if we 
want to load the quadword at offset Z = 5, where register R11 holds the aligned address 
1000, the calculation for EA becomes: 


EA = (1000 + 5) AND NOT 7 


which evaluates to 1000. What gets loaded then is the quadword at the next lower 
aligned location Y that spans part of the desired unaligned quadword. In order to obtain 
the remaining part of the unaligned quadword, we need to specify another load 
unaligned operation using Z+7(R11) as the address specifier. Now, in general, the 
results from these two load operations will be two successive aligned quadwords with 
schematic contents as follows: 


YYYH GFED :Y+8 CBAX XXXX iY 


Byte extract (ext) instructions can be used to collect the desired bytes from these two 
quadwords with appropriate shifts. A logical OR operation (using the bis instruction) 
will then fit the odd lots together. Such instruction sequences seem like “black magic” 
but can be understood with some careful study. 

In Alpha architecture handbooks, the vendor has published certain “intended 
sequences” of instructions for accessing unaligned information units. Those recommen- 
dations are likely to result in the best possible performance with present and future 
Alpha CPU chip implementations. 

We will now give those recommended instruction sequences for loading informa- 
tion units of various lengths with filling at the left using either zeros or sign-bit exten- 
sion. We will put comments on the instruction sequences that show the particular case 
where (EA modulo 8) = 5. This value 5 will correspond to the lowest 3 bits in register 
R3 after an 1da instruction in each of the sequences below. 


Loading an Unaligned Quadword 


ldg_u R1,2Z(R11) ; R1 = CBAx xxxx 
ldgq_u R2,Z+7(R11) ; R2 = yyyH GFED 
lda R3,2(R11) + R3<2:0> = [(R11)+Z] modulo 8 = 5 
extql Ri, RS ,R1 > R1 = 0000 OCBA 
extgh R2,R3,R2 ; R2 = HGFE DOOO 


bis R2,R1,R1 ; R1 = HGFE DCBA <- (R2 OR R1) 
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Each 1dq_u instruction converts a given byte address into an aligned quadword 
address and then loads the datum from that aligned location. This makes the effective 
address given as Z (R11) in the first instruction become some aligned address Y as we 
showed above. Similarly, an unaligned load with an effective address given as 
Z+7(R11) will actually obtain the datum from the aligned address Y+8, in general. 
The 1da instruction loads the actual byte address given by the signed offset Z plus the 
contents of register R11. 

Since the extql instruction uses a shift amount between 0 and 7 (bytes), it only 
looks at the lowest three bits of register R3. What we get is then: 


[(R11) + Z] modulo 8 =5 


because we have assumed (R11) = 1000 and Z = 5. Thus the extq]1 shifts the bytes in 
register R1 to the right 5 positions with zero filling at the left as explained in the previ- 
ous section. The complementary ext qh instruction shifts the bytes in register R2 to the 
left 3 positions. Finally, the bis instruction glues those two pieces together in register 
R1. Review the logical OR operation if you are unsure about this last step. 

Let us verify that the behavior of the above sequence would also function cor- 
rectly even in the special case (EA modulo 8) = 0. Because we chose Z+7 (R11) 
instead of Z+8 (R11), the two 1dq_u instructions will put the same aligned quadword 
data into registers R1 and R2 from aligned address Y. Quadword alignment implies that 
bits <2:0> are zero in the effective address put into register R3 by the 1da instruction. 
There will be no shifting of bytes in the data in either R1 or R2, and the extq instruc- 
tions will degenerate to nops. Similarly, the bis instruction will just OR two copies of 
the same datum. The optimized single 1dq R1,Z(R11) instruction could be substi- 
tuted for efficiency only if (EA modulo 8) is actually known to be zero at the time that 
the program is written. 


Loading and Zero-Extending an Unaligned Longword 


ldgq_u R1,Z(R11) ; R1 = CBAX XXXX 

ldq_u R2,Z+3(R11) ; R2 = yyyy yyyD 

lda R3,Z(R11) > R3<2:0> = [(R11)+Z] modulo 8 = 5 
extll R1,R3,R1 > R1 = 0000 OCBA 

extlh R2,R3,R2 > R2 = 0000 DOOD 

bis R2,R1,R1 « R1 = 0000 DCBA <- (R2 OR R1) 


Obtaining an unaligned longword is thus very similar to obtaining an unaligned quad- 
word. In this case and the next, where a longword information unit is sought, the offset 
for the second load should be expressed as Z+3, not Z+7, since any address beyond 
Z.+3 cannot contain any bytes in common with a longword at Z. 
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Loading and Sign-Extending an Unaligned Longword 


ldq_u R1,Z(R11) ¿>< RL = CBAxX Xxx 

läg ü R2,2+3(R11) ; RZ = yyyy yyyD 

lda R3,Z(R11) ; R3<2:0> = [(R11)+Z] modulo 8 = 5 
ext1l Ri, RIJAL ; R1 = 0000 OCBA 

extlh R2,R3,R2 ; R2 = 0000 DOOO 

bis R2; RL,RL ; R1 = 0000 DCBA <- (R2 OR R1) 
addl RLR, Ri ; R1 = ssss DCBA 


This case not only shows how short sequences of instructions can fit together like build- 
ing blocks, but also how the natural results of instructions can have surprising uses. The 
basic load sequence is identical to the previous unsigned case. 

The add1 instruction, at first glance, seems stupid. Why add zero to the quantity 
in register R1? The addl instruction always produces a sign-extended sum, which is 
just what we want in this instance. The signed quantity in register R1 will then behave 
appropriately in any subsequent longword or quadword instructions. Remember that the 
aligned 1d1 instruction likewise performs sign extension in the destination register. 


Loading and Zero-Extending an Unaligned Word 


ldg_u R1,Z(R11) ; R1 = yBAx xxxx 

ldq_u R2,Z+1(R11) ; R2 = yBAx xxxx 

lda R3,2 (RILI) ; R3<2:0> = [(R11)+Z] modulo 8 = 5 
extwl RIR Rt ; R1 = 0000 OOBA 

extwh Rz,R3,R2 ; R2 = 0000 0000 

bis R2, R1 RI ; R1 = 0000 OOBA <- (R2 OR R1) 


Note the similarity to loading unaligned quadwords or longwords. In this case and the 
next, where a word-length information unit is sought, the offset for the second load 
should be expressed as Z+1, since any address beyond Z+1 cannot contain any bytes in 
common with a word at Z. 


Loading and Sign-Extending an Unaligned Word 


ldq_u R1,2 (RLI) ; R1 = yBAx xxxx 

ldq_u R2,Z+1(R11) ; R2 = yBAx xxxx 

lda R3; Z+2 (R11) ; R3<2:0> = 5 +2 = 7 

extql R1,R3,R1 ; R1 = 0000 000y 

extgh RZ, RS RZ ; R2 = BAxx xxx0 

bis R2,R1,R1 ; R1 = BAXX xxxy <- (R2 OR R1) 


sra R1,48,R1 ; R1 = ssss ssBA 
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This recommended procedure uses an augmented value of the shift parameter (still 
modulo 8) and quadword ext instructions. Since the desired data are moved to the high 
end of register R2, a simple arithmetic shift leads to the desired outcome. The signed 
result will then behave appropriately in any subsequent longword or quadword instruc- 
tions. The Alpha does not have arithmetic or logical operations specific to word data. 


Loading and Zero-Extending a Byte 


ldgq_u Ri, Z{RL1L) > R1 = yyAx KXXX 
lda R3,Z (R11) > R3<2:0> = [(R11)+Z] modulo 8 = 5 
extbl R1,R3,R1 > RL = 0000 OOOA 


In this case and the next, where a byte-length information unit is sought, only one 
unaligned load instruction is required. For this simple (unsigned) load, the total of three 
instructions is about what one expects in going from a CISC architectural model to a 
RISC model, namely two to three RISC instructions needed to do something equivalent 
to one CISC instruction. 


Loading and Sign-Extending a Byte 


ldq_u R1,Z(R1i1) > RL = yyAx XXXX 

lda R3,Z+1 (R11) ; R3<2:0> = [(R11)+Z+1] modulo 8 
; (a special trick here) 

extqh R1,R3,Ri > RL = AxxxX EXXX 

sra Ri, 56,241 ; R1 = ssss sssA 


In this instance, modifying the displacement in the 1da instruction at assembly time from 
the expected Z to Z+1 eliminates any need for an s11 instruction. The ext gh instruction 
positions the sign bit of the desired byte into bit 63 of register R1. Then the sra instruc- 
tion removes the 7 extraneous bytes from the right end and replicates the sign bit. That iS, 
the right and left shifts are complementary. The signed quantity in register R1 will then 
behave appropriately in any subsequent longword or quadword instructions. The Alpha 
does not have arithmetic or logical operations specific to byte data. 


SCANTEXT: An Example of Byte Input 


Computers can do much more than just calculate mathematical functions. Many impor- 
tant applications involve analysis of data encoded as ASCII characters, stored one per 
byte as strings or records. Suppose that we wish to engage in some very rudimentary 
textual analysis, perhaps just to count the number of words in a line of text, considering 
words to be separated by spaces and the end of the line as a zero or null byte. 
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The program SCANTEXT in Figure 6.2 will accomplish this task, as well as add 
up the total number of characters within the words, including any adjacent punctuation 
but not the spaces. 


tli ai ect iia tH 
/* SCANTEXT Text Analyzer 


(Unix) 


ey 


/* This program will count the number of characters 


(including punctuation but not spaces) 


in a sentence 


and find how many words it contained. */ 


TEXT : 


main: 


Lies: 


next: 


word: 


nomore: 


done: 


Figure 6.2 SCANTEXT: An illustration of inp 


# 
# 
# 


# 


ASCII code for <SP> 
Offset to char count 
Offset to word count 


Space for results 


"The faster I run the behinder I get." 


SPACE = 0x20 

CHARS = 0 

WORDS = 8 

.data 

. comm QUADS, 2*8 
.asciiz 

. text 

-align 4 

.set noreorder 
-globl main 

.ent main 

ldgp S$gp,0 ($27) 
„trame Ssp,0,$26,0 
.prologue 1 

alobi first 

bis $31,931, $2 

bis SATS Nee So S0 

lda $3,TEXT-1 

lda $3.,.4 ($3) 

ldg_u Sle (a3) 

extbl SL, (Sy SL 

beg $1,nomore 

subq $1,SPACE,S$1 

beq $1,word 

addq Garir Be 

br $31,next 

addq $0,1,$0 

br $31,next 

addq $0, 21, $0 

lda $3, QUADS 

stq $2,CHARS ($3) 

stq $0,WORDS ($3) 

mov 0,$0 

ret Soke (926) aL 
end main 


+ + +H OH OH OH OH OH OH 


+ H+ HHH HH HH HH HH OH FH OH OH OH H H 


Section for program code 
Octaword alignment 
Disallow rearrangements 
These three lines 
mark the mandatory 
'main' program entry 
Load the global pointer 
Describe the stack frame 
Say that $gp is in use 


R2 counts characters 

RO counts words 
Pre-index 

R3 -> next character 
Quadword containing char 
R1 = character code 

End of string? 
Destructive testing 

End of word 

Count one character 

Look for more chars 
Count one word 

Look for more chars 

The last word 

R3 -> numerical storage 
Number of characters 
Number of words 

Signal all is normal 
Back to Unix environment 
Mark end of procedure 


ut of unaligned bytes 
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This illustrative program expects to find a null byte to mark the end of the input 
text (the .asciiz directive adds this byte). The address pointer is pre-indexed in order 
to compensate for the incrementation at label next on entrance to the main loop. For 
each character, a three-way case structure (two conditional branches) determines 
whether it is a null (end of line), a space (end of word), or any other character. Multiple 
adjacent spaces would lead to an overestimate of the number of words. The sample sen- 
tence in SCANTEXT contains 8 words and 29 characters (including the period). 


Had we not chosen pre-indexing of register R3 as part of the initialization phase of 
this program, but instead had pointed exactly to TEXT, we would then have had to 
include two separate instances of the incrementation of the pointer prior to each back- 
ward branch rather than the single instance at next. In general, it is advisable to avoid 
distributed replicates of code elements, primarily to simplify the logical flow and pro- 
mote maintainability, not merely to economize on memory or even to enhance execu- 
tion speed. 

Somewhat arbitrarily we chose to use destructive testing on the throw-away copy 
of a character code in register R1 using the subg instruction. Such destructive tests are 
common practice, but the Alpha instruction set offers the alternative of a cmpeg 
instruction. What particular branch instruction would then be needed right after a 
cmpeq instruction? This choice between subg (destructive) and cmpeq (non-destruc- 
tive) shows that even RISC instruction sets provide considerable latitude for expressing 
algorithms. 


Insert Byte Instructions 


The byte extract (ext) instructions have two sets of inverse operations, the byte insert 
(ins) and byte mask (msk) instructions in the Alpha architecture. As we shall see, 
these instructions produce complementary effects. And, as we shall also see, these two 
instruction types are both needed to support storing into information units at arbitrary 
byte addresses, given the other intrinsic limitations of the Alpha architecture. 


The group of insert byte instructions (ins) can selectively position new data in 
registers prior to storage in memory. Seven different mnemonic opcodes are distin- 
guished by various values in the function code field for these instructions in the integer 
operate class, as given in Table 6.2 The assembler syntax is: 


instl Ra, Rb,Re ; 
instil Ra, bt, Be z 
insth Ra, Rb, Rc ; 
insth Ra, Lity Re ; 


4 


q 
, q 


Gr CF OE or 
Io l 

= = 
fA 

QO OQ Pe fF 
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where the character t in the opcode indicates what portion of the source value in regis- 
ter Ra will be shifted as it is copied into destination register Rc. Only 1, 2, 4, or 8 bytes 
from the right side of the source value are used for t = b, w, 1, or q, respectively; at the 
left side, 7, 6, 4, or 0 bytes are zeroed. 


Table 6.2 Alpha Insert Byte Instructions 





Function 

Mnemonic Opcode Code Purpose 
insbl 12 OB Insert byte low 
inswl L2 1B Insert word low 
ins1ll 12 2B Insert longword low 
insql 12 3B Insert quadword low 
inswh 12 as Insert word high 
inslh 12 67 Insert longword high 
insqh 12 ae Insert quadword high 





The lowest three bits found in register Rb or the literal determine the shift amount 
in bytes that the masked byte, word, longword, or quadword is shifted. For opcodes 
ending in 1 (low), the shift is to the left by s byte positions with fill on the right of an 
all-zero byte for each byte shifted off the left end. For opcodes ending in h (high), the 
shift is to the right by the complementary amount, 8 — s byte positions rather than by s 
bytes, with fill on the left with all-zero bytes. 

For example, if s = 5, and the bytes in register Ra are labeled a to h (right-to-left) 
as shown in the top line in Figure 6.3 (designated :Ra), then the shifted value placed in 
register Rc for an insert opcode ending in 1 would be cba00000, where the five zeros 
at the right represent zero bytes. If the same value of s were used but the insert opcode 
ended in h, the result would be 00 0hgfed, since the shift amount would be 8 — 5, or 3. 





-=h 
(S 
Ja 
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JJ 
D 


inspi 


inswl 


insi 


insql 
inswh 


inslh 


h| øg | t | e | d jinsa 


Figure 6.3 Insert byte instructions with shift parameter s = 5 


When the shift property is combined with the masking to a particular size of infor- 
mation unit, the ins instruction offers 56 different combinations stemming from the 
seven instruction types (e.g., ways to mask portions of the source data) and the eight 
values for the magnitude s of the shift. We can schematically diagram one set of 7 cases 
for the value 5 for the shift parameter, as shown in Figure 6.3. Note that h (high) and 1 
(low) are from the perspective of the source register Ra, not the destination register Rc. 
Note also that the masking effect for the information unit ¢ is also from the perspective 
of the source register Ra. 


Mask Byte Instructions 


Each member of the group of mask byte instructions (msk) is completely complemen- 
tary to the corresponding ins instruction: at byte positions where ins has zero fill, 
msk copies data bytes, and vice versa. Seven different mnemonic opcodes are distin- 
guished by various values in the function code field for these instructions in the integer 
operate class, as given in Table 6.3. The assembler syntax is: 


msktl Ra; RD, RC . £2 Bb, W, l 8 
msktl Ra, Lit, Re ‘ $s D WwW 1, & 
mskth Ra, Rb, Rc Y e s o a 
mskth Ra; Lit, Re = En W, dys a 
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where the character t in the opcode indicates the size of a mask (e.g., one byte, one 
word, one longword, or one quadword) which should be shifted and then used to deter- 
mine the byte positions where data bytes from the source register Ra will be replaced 
with zero bytes before being copied into destination register Rc. 


Table 6.3 Alpha Mask Byte Instructions 


Mnemonic Opcode Function Code Purpose 
mskb1 12 02 Mask byte low 
mskw1 12 [2 Mask word low 
msk1l1 12 22 Mask longword low 
mskql 12 32 Mask quadword low 
mskwh 12 52 Mask word high 
msklh 12 62 Mask longword high 
mskqh 12 72 Mask quadword high 


The lowest three bits found in register Rb or the literal determine the shift amount 
in bytes that the byte-, word-, longword-, or quadword-width mask is to be shifted. For 
opcodes ending in 1 (low), the shift is to the left by s byte positions. For opcodes ending 
in h (high), the mask has an effective width of t — (8 — s) bytes, or zero if that quantity 
would be negative, and extends from the right end of the destination register Rc towards 
the left. 

For example, if s = 5, and the bytes in register Ra are labeled a to h (right-to-left) 
as shown in the top line of Figure 6.4 (designated :Ra), then the mask begins at byte 
position 5 and extends to the left for t= 1, 2, 4, or 8 bytes (or to the left edge of destina- 
tion register Rc) for a mask opcode ending in 1. If the same value of s were used but the 
mask opcode ended in h, the mask would have a width of MAX(t — 3, 0), and would 
take effect at the right end of destination register Rc. 
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: Ra 
ph | g| o0 | e | da | c | b | a jms 
mskwl 


po | oj ojej ada | c | b | a janski 
poo | o j]j of ee Joa | c | b | a jnska 
oh] øg] t] eja fie | b | a jmsiwh 
oh | g] t| e { da | c | b | o0 jnskm 
oh] øg] t | o| o | o | o | o jmska 


Figure 6.4 Mask byte instructions with shift parameter s = 5 


When the shift property is combined with the masking to a particular size of infor- 
mation unit, the msk instruction offers 56 different combinations stemming from the 
seven instruction types (e.g., ways to construct a mask) and the eight values for the 
magnitude s of the shift parameter. We can schematically diagram one set of 7 cases for 
the value 5 for the shift parameter, as shown in Figure 6.4. 


Storing Modified Unaligned Data 


The complementary insert and mask byte instructions can work together to edit specific 
portions of unaligned quadwords before using unaligned store instructions to transfer 
the data to the memory subsystem. 


Suppose that we are using register R11 as the base register for the byte address 
and that we want to store a quadword datum at offset Z from the address in register 
R11. We wish to store the eight bytes symbolically given as hgfe dcba that are cur- 
rently in some register. Further suppose that the original contents in memory at that 
address are the eight bytes symbolically given as HGFE DCBA. The load and store 
unaligned instructions, 1dq_u and stq_u, compute the effective address (EA) by 
adding the sign-extended offset to the address value in the register and then masking 
out the three least significant bits. This changes the “any byte” EA into a quadword 
aligned value. 
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For example, if we want to store the quadword at offset Z = 5, where register R11 
holds the aligned address 1000, the calculation for EA becomes: 


EA = (1000 + 5) AND NOT 7 


which evaluates to 1000. What gets stored then is the quadword at the next lower 
aligned location Y that spans part of the desired unaligned quadword. In order to store 
the remaining part of the unaligned quadword, we need to use another store unaligned 
operation with Z+7 (R11) as the address specifier. 

We cannot simply store at those addresses, however, because there will be some 
bytes containing information that should remain unaltered. Remember, we have one 
quadword to store in a manner split across two aligned quadword locations in memory. 
Therefore, the overall procedure for storing at an unaligned address must begin with 
loading from that same address first, by using two unaligned load instructions. In gen- 
eral, the results from these two load operations will be two successive aligned quad- 
words as follows: 


YYYH GFED :Y+8 CBAX XXXRX zY 


Insert instructions can be used to partition the new datum to be stored, hgfe dcba, 
into two registers: 


000h gfed cba0 0000 
Mask instructions can be used to zero out the old bytes, HGFE DCBA, in two other reg- 
isters: 

yyy0 0000 O00 atx 


Two logical OR operations (using the bis instruction) can be used to fit the odd lots 
together in two registers. Then the values in those registers can be stored at aligned 
locations Y and Y+8: 


yyyh gfed :Y+8 cbax xxix. zY 


Such instruction sequences seem like “black magic” but can be understood with some 
careful study. Figure 6.5 shows in a schematic way how such sequences work. 
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original Q 


modified Q 





Figure 6.5 Modification of unaligned data using the byte manipulation instructions 


In Alpha architecture handbooks, the vendor has published certain “intended 
sequences” of instructions for accessing unaligned information units. Those recommen- 
dations are likely to result in the best possible performance with present and future 
Alpha CPU chip implementations. 


We will now give those recommended instruction sequences for storing informa- 
tion units of various lengths. We will put comments on the instruction sequences that 
show the particular case where (EA modulo 8) = 5. This value 5 will correspond to the 
lowest 3 bits in register R6 after an 1da instruction in each of the sequences below. 


Register R5 contains the quadword value hgfe dcba which is to be stored in an 
unaligned fashion as a quadword (hgf£edcba), longword (dcha), word (ba), or byte (a). 
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Storing an Unaligned Quadword 


lda R6,2Z(R11) > R6<2:0> = [(R11)+Z] modulo 8 = 5 
ldq_u R2,Z+7 (R11) ; R2 = yyyH GFED 

lağ- ü R1: X (REL) 3 RL = GCGBAX XXKX 

insqh R5,R6,R4 ; R4 = 000h gfed 

insql R5,R6,R3 ; R3 = cbad 0000 

mskqh R2,R6,R2 ; R2 = yyyO 0000 

mskql Ki, RORI > R1 = OOOX xxxx 

bis R2,R4,R2 ; R2 = yyyh gfed 

bis Ri R3,.R1 > Rl = chár 200 

stq_u R2,Z+7 (R11) ; Store high first for the 

stq_u R1,Z(R11) ; degenerate case of aligned quad 


Let us verify that the behavior of the above sequence is also correct even in the special 
case (EA modulo 8) = 0. Because we chose Z+7 (R11) instead of Z+8 (R11), the two 
1dq_u instructions will put the same aligned quadword data into registers R1 and R2. 
Quadword alignment implies that bits <2:0> are zero in the effective address put into 
register R6 by the 1da instruction. The ins and msk instructions will put all zeros into 
R1, R2, and R4. The optimized single stq R5, Z (R11) instruction could be substi- 
tuted for efficiency only if (EA modulo 8) is known to be zero at the time that the pro- 
gram is written. 


Storing an Unaligned Longword 


lda R6,Z(R11) > R6<2:0> = [(R11)+Z] modulo 8 = 5 
lag u R2,Z+3 (R11) ; R2 = yyyy yyyD 

ldq_u Ri. 2 (RIL) >. R1 = CBAX KXXX 

inslh  R5,R6,R4 ; R4 = 0000 000d 

insll R5,R6,R3 - R3 = chad 0000 

msklh R2,.R6, R2 ; R2 = yyyy yyy0 

mskll Ri, RO, RL. > R1 = 000X xxxx 

bis R2,R4,R2 > R2 = yyyy yyyd 

bis Ri, RS, RL > RI = cbhax xxxx 

stq_u R2,Z+3(R11) ; Store high first for the 
stq_u Ri AARI) ; degenerate case of aligned 


In this case, where a longword information unit is accessed, the offset for the high-order 
load should be expressed as Z+3, not Z+7, since any address beyond Z+3 cannot con- 
tain any bytes in common with a longword at Z. Otherwise, the scheme is very similar 
for storing a modified longword as for a quadword. 
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Storing an Unaligned Word 


lda R6,Z(R11) ; R6<2:0> = [(R11)+Z] modulo 8 = 5 
ldq_u R2Z,Z2*1L{R11) ¢ RZ = yBAoe soocx 

ldq_u R1,2Z(R11) > Ri = yBAx KXXX 

inswh R5,R6,R4 ; R4 = 0000 0000 

inswl R5,R6,R3 * R3 = ObaO 0000 

mskwh R2,R6,R2 = R2 = YBAE sees 

mwkwl Ri; ROeRL ; RL = yOOx xxxx 

bis R2,R4,R2 > R2 = VRAX xxi 

bis Bl. RS yee e Rl = ybhax KXXX 

stq_u R2,2+1 (R11) * Store high first for the 
stq_u R1,2Z(R11) ; degenerate case of aligned 


In this case, where a word-length information unit is accessed, the offset for the high- 
order load should be expressed as Z+1, since any address beyond Z+1 cannot contain 
any bytes in common with a word at Z. Otherwise, the scheme is very similar for stor- 
ing a modified word as for a quadword or longword. 


Storing a Byte 


lda R6,Z(R11) ; R6<2:0> = [(R11)4+Z] modulo 8 = 5 
ldg_u R1,Z(R11) > R1 = yvAx xxxex 

insbl R5,R6,R3 ; R3 = 00a0 0000 

mskbl R1,R6,R1 ; R1 = yyOx xxx 

bis RL,BS, RI s R1 = yvax xxxx 

stq_u R1 -Z (R11) ; Store 


In this case, where a byte-length information unit is accessed, only one unaligned load 
instruction and, correspondingly, only one unaligned store instruction will be required. 


Technical Specifications of the Extract, Insert, and Mask 
Byte Instructions 


This section of our book is not “core” material. Readers who only skim lightly through 
this section will not be handicapped in comprehending material presented later. 

The instructions for byte manipulations are more complicated than most other 
Alpha instructions. In any architecture, the precise operation of various machine 
instructions can be defined in a more formal way using a pseudocode notation that 
resembles a high-level language. For conceptual clarity, the pseudocode may introduce 
what look to be intermediate quantities. Such “hidden variables” are seldom stored, but 
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rather just represent transitory states of the actual digital logic elements that implement 
the architecture. 

In the following pseudocode, a final character of v denotes “the value of” and the 
notation <p: q> denotes a range of bits. Alpha implementations may include optional 
big-endian byte addressing support. As we mentioned in Chapter 2, “endian” is a term 
referring to the ordering of bytes contained in, say, a quadword information unit actu- 
ally holding an integer. In the discussion below, the left-to-right convention for bit num- 
bering within bytes adopted by Digital Equipment Corporation still applies to the 
description of big-endian as well as little-endian operations. 


Extract Byte Instructions 


CASE 
big_endian_data: Rbv' £= Rbv XOR “Billi 
little endian data: Rbv' := Rbv 

ENDCASE 

CASE 


extbl: byte_mask := *b0000 0001 
extwx: byte_mask := *b0000 0011 
extlx: byte_mask := *b0000 1111 
extqx: byte_mask := %b1111 1111 


ENDCASE 
CASE 
exttl: 
byte loc 4 Bbv'<2:0>*6 
temp := RIGHT_SHIFT(Rav, byte_loc<5:0>) 
Rc := BYTE _ZAP(temp, NOT(byte_mask) ) 
extth: 
byte_loc := 64 = Rbv'<2:0>*8 
temp := LEFT_SHIFT(Rav, byte_loc<5:0>) 
Rc := BYTE_ZAP(temp, NOT (byte_mask) ) 
ENDCASE 


The sequences of instructions for loading data on an Alpha system with an operat- 
ing system that boots up the CPU in big-endian mode are somewhat different than those 
presented earlier in this chapter under the presumption that little-endian mode would be 
in effect. 

With the same assumptions that we used earlier, that is, for access from an effec- 
tive address modulo 8 that would result in a shift parameter of 5 for little-endian data, 
the sequence for loading and zero-extending a byte is as follows in a big-endian envi- 
ronment: 
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tag u RL ZEL) * RL = sex xAvy 
lda RS, ZREL] ; R32 0> =5, but. shift = 2 bytes 
extbl R1, R3, R1 z R1 = 0000 000A 


Note that the shift for the extract byte instruction is computed differently for big-endian 
than for little-endian (5 XOR ^b111 results in ^b010, or 2). 

The sequence for loading a quadword from an unaligned address is considerably 
different for a big-endian environment: 


ldg_u RLZ LL) ; RI = xxxx XABC 
ldq_u R2,Z+7(R11) ; R2 = DEFG Hyyy 
lda R3,Z+7(R11) ; R3<2:0> =4, but shift = 3 bytes 
extgh | i ee ; R1 = ABCO 0000 
extql RZ R3 R2 ; R2 = 000D EFGH 
bis Ri, R2; R1 ; R1 = ABCD EFGH 


The address displacement in the 1da instruction should be Z+7 for quadwords, Z+3 
for longwords, and Z+1 for words; for a little-endian environment, these are all just Z. 
Furthermore, the extqh and extql instructions are reversed with respect to the little- 
endian sequence. 

The aligned load and store instructions also compute an effective virtual address 
differently in the two byte-ordering modes. The big-endian virtual address is formed as 
the XOR of “b100 and the little-endian virtual address. 


Insert Byte Instructions 


CASE 
big_endian_data: Rbv' := Rbv XOR “*b111 
little _endian_data: Rbv' := Rbv 

ENDCASE 

CASE 


insbl: byte_mask := ^b0000 0000 0000 0001 
inswx: byte_mask := “*b0000 0000 0000 0011 
inslx: byte_mask := “*b0000 0000 0000 1111 
insqx: byte_mask := *b0000 0000 1111 1111 


ENDCASE 
byte_mask := LEFT_SHIFT(byte_mask, Rbv'<2:0>) 
CASE 
instl: 
byte_loc := Rbv'<2:0>*8 
temp := LEFT_SHIFT(Rav, byte_loc<5:0>) 


Rc := BYTE_ZAP(temp, NOT(byte_mask<7:0>) ) 
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insth: 
byte_loc := 64 - Rbv'<2:02>*8 
temp := RIGHT_SHIFT(Rav, byte_loc<5:0>) 
Rc := BYTE_ZAP(temp, NOT(byte_mask<15:8>) ) 
ENDCASE 


Unlike the ext instruction, the ins instruction can be described best using a 
binary mask that is 16 bits wide, not 8 bits wide. Different, but interrelated, masks are 
needed for the high and low variants of the ins instruction. 


Mask Byte Instructions 


CASE 
big_endian_data: Rbv' := Rbv XOR ^p)111 
little endian data: Rbv' := Rbv 

ENDCASE 

CASE 


mskbl: byte_mask := *b0000 0000 0000 0001 
mskwx: byte_mask := *b0000 0000 0000 0011 
msklx: byte_mask := *b0000 0000 0000 1111 
mskqx: byte_mask := *b0000 0000 1111 1111 
ENDCASE 
byte_mask := LEFT_SHIFT(byte_mask, Rbv'<2:0>) 


CASE 
msktl: 
Re = 
mskth: 
Rc := BYTE_ZAP(temp, NOT(byte_mask<15:8>) ) 
ENDCASE 


BYTE _ZAP(temp, NOT (byte_mask<7:0>) ) 


Unlike the ext instruction, the msk instruction can be described best using a 
binary mask that is 16 bits wide, not 8 bits wide. Different, but interrelated, masks are 
needed for the high and low variants of the msk instruction. 


Using C for ASCII Input and Output 


For a change of pace after presenting some challenging Alpha byte instructions, and 
before we finish this chapter with two additional types of byte instructions, we now 
present a more convenient method than the debugger for input and output of ASCII text 
strings. This section also foreshadows the material about procedures that predominates 
in the next chapter. This preview is in keeping with our general strategy of introducing 
certain major topics at an overview level at first and then attending to rules and details 


164 Chapter 6 • Working with Bytes on the Alpha 


sometime later. The minimal exposition here widens the range of feasible programming 
experimentation (see exercises at the end of this chapter). Readers who do not delve 
more fully into the calling standards for an operating system will thus have some expo- 
sure to the concept. | 

Textbook presentations of assembly language programming have sometimes 
implied that routines written in assembly language can be called from high-level lan- 
guages, but—perhaps more by silence than by statement—that routines written in a 
high-level language cannot be, or ought not to be, called from a main program written 
in assembly language. Yet in the OpenVMS environment, and to some extent for the 
Unix environment also, the situation is much more symmetrical. So long as published 
“calling standards” are very meticulously adhered to, a software design team has rather 
full freedom to write each routine in the most convenient or appropriate language. 

Our minimal but nevertheless instructive illustration involves a pair of C external 
routines, chrput encapsulating one invocation of put char that will print a character 
on the user’s terminal and chrget encapsulating one invocation of get char that will 
obtain a character from the user’s terminal (Figure 6.6). Even though these routines pro- 
cess individual characters, we will need to be aware that the operating system environ- 
ment handles terminal input and output a line at a time. 


/* Encapsulated putchar and getchar routines. 
Note: int corresponds to longword on the Alpha. */ 


#include <stdio.h> 


int chrput(int ch) 
{ 


return putchar (ch); 


} 


int chrget() 
{ 


return getchar(); 


} 
Figure 6.6 Encapsulating the putchar and getchar routines of C 


We use encapsulated routines, rather than putchar and getchar themselves, 
because these C routines may or may not—at the implementer’s option—be “macros” 
that insert in-line instructions into a program rather than act as true C callable functions. 
The power of encapsulation, which you may already appreciate from prior study of 
computer science, permits us to hide details which are unduly system-dependent or 
temporarily distracting. 
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Before presenting a somewhat realistic program in the next section, we are first going 
to present here a more rudimentary program that serves only to test the correctness of this 
technique for performing input and output without resorting to the debugger as before. 

For both OpenVMS and Unix programming environments, simple called proce- 
dures interact with their callers using register RO and/or registers R16 through R21. 
Also, a quantity typed int in C on the Alpha corresponds to a longword (32 bits) at the 
assembly language level. The character code obtained by chrget (getchar) is 
returned as a longword in register RO. The character code to be processed by chrput 
(putchar) is specified as a longword in register R16. 

Calling an external routine from assembly language is somewhat closer to custom- 
ary experience with high-level language programming for OpenVMS than for Unix; 
hence we present the OpenVMS variant first. 


Introducing I/O for OpenVMS 


Figure 6.7a shows the OpenVMS variant of our test program that calls the C routines in 
order to validate their operation. We introduce here another MACRO-64 system- 
supplied macro, $cal1, which can be simplified for our present purposes by just using 
the name of the external routine to be called and no more than one argument holding the 
ASCII value of a character as a longword. The args parameter for $call specifies 
arguments and their data type, here longword (/1 qualifier). 





.title IO_C Test character I/O using C (OpenVMS) 

NL = sa ; Newline character 

Sroutine io_c, data_section_pointer=true,- 
kind=stack, saved_regs=<r14,r15> 


Sdata_section 


Scode_section 


mov ¥27/,x14 ; Copy linkage section pointer 

. base rie, Sis ; R14 -> linkage section 

ldq ri5.,5dp ; R15 => data section 

.base t15, $das ; Tell MACRO about this 

Scall chrget ; Get some input 

Scall chrput, args=<r0/1> ; Echo the character 

mov NL, r16 ; Newline 

Scall chrput ; args=<ri16/1> ; Echo the character 
done:: mov Ly, ru ; Tell OpenVMS 

Sreturn ; we're ending normally 

Send_routine io_c ; Needed by Sroutine 

.end LO ; Set start address 


Figure 6.7a JO_C: a test of input and output (OpenVMS) 
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By convention each $ca11 changes the contents of register R27. We thus have to 
use some other register (R14) instead of register R27 for base addressing down through 
the whole routine. Again, by convention, any called procedure is supposed to have free- 
dom to use certain registers without saving or restoring them. We will later state the 
rules completely, but the vulnerable registers include RO and R1 that we have used in 
some of our previous illustrations. Registers R2 through R15 are supposed to be 
“respected” by any called routine in the OpenVMS programming environment. We will 
tend to use registers R9 through R15 for items to be retained during a procedure call, 
since registers R2 through R8 are more volatile in the Unix programming environment. 
We should save the registers used in our routine via the saved_regs parameter of the 
Sroutine macro. 

We assume here that you have access to the C compiler (cc command) for Open- 
VMS. The file containing the two C routines is compiled to OBJ format, and the 
MACRO-64 main program is assembled to OBJ format. Then the complete program is 
built with the linker. Suitable system commands are as follows: 


cc getput 
macro/alpha io_c 
link io_c,getput 
run ioc 


Ur Ur UF Ur 


Since the debugger is not needed for I/O, the /debug qualifier can be omitted from 
those commands unless we wish to be able to step through the program to monitor its 
operation. The assembly listing file (/1is qualifier) and link map file (/map qualifier) 
used previously are also optional. 

The linker collects the executable portions from the IO_C program and the other 
two modules into the program section called $CODES. The linker also completes the link- 
age section SLINKS with all necessary intermodule pointers for addressing purposes. 

The executable file (EXE format) takes its name by default from the first OBJ file 
on the link command line. Only the main program should specify a symbolic start 
address on its . end statement. Note that neither C routine is named main, in order to 
instruct the C compiler to suppress a start address. 

If we type some characters, say abc, and then push the return (or enter) key, the 
program should output the first character that we typed on a line by itself. This program 
will “hang” and do nothing until we have indeed pushed return (or enter), because of 
the way that the operating system processes terminal input a line at a time instead of a 
character at a time. No characters are accessible to chrget (getchar) until a whole 
line has been received. Similarly, no output will become visible until a program has sent 
out a “newline” character. 
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Introducing I/O for Unix 


Figure 6.7b shows the Unix variant of our test program that calls the C routines in order 
to validate their operation. We preview here another Alpha instruction, jsr, which 
jumps to a subroutine. The Unix programming environment does not have a system- 
supplied facility like $cal1 (OpenVMS only) for managing the arguments used in pro- 
cedure calls. Thus we must supply the character code for chrput as a longword value 
in register R16 (i.e., using mov). 





/* IO_C Test character I/O using C (Unix) */ 


NL = "\n" # Newline character 
STACK = cL # Quadwords needed 
FRAME = ((STACK*8+8) /16) *16 
. text # Section for program code 
-align 4 # Octaword alignment 
.set noreorder # Disallow rearrangements 
.globl main # These three lines 
.ent main # mark the mandatory 
main: = 'main' program entry 
ldgp Sgp,0($27) # Load the global pointer 
lda $sp, -FRAME ($sp) # Allocate stack space 
stq $26,0(S$sp) # Save our own exit address 
.mask 0x04000000,-FRAME # Saved only register R26 
.frame S$sp,FRAME,$26,0 # Describe the stack frame 
.prologue 1 # Say that $gp is in use 
„sgLobl first 
first: JSr $26,chrget # RO = character obtained 
ldgp Sgp,0($26) # Restore global pointer 
mov $0,$16 # R16 = character 
Jr $26,chrput # to send 
ldgp Sgp,0($26) # Restore global pointer 
mov (NL) , $16 # R16 = character 
jsr $26,chrput # to send 
ldgp Sap, 0(S26) # Restore global pointer 
done: mov 0; 80 # Signal all is normal 
ldq $26, (Ssp) # Restore exit address 
lda Ssp,FRAME(Ssp) # Restore stack level 
ret Sse (ozZe) ol # Back to Unix environment 


.end main # Mark end of procedure 


Figure 6.7b JO_C: a test of input and output (Unix) 


Further, there is no assembler-maintained table of base registers that the program- 
mer designates for base addressing, as with the .base directive (OpenVMS only). 
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Since the global pointer (Sgp, or register R29) is used within any called routine, there 
has to be an explicit reloading of that register every time control returns to a caller 
(main in the IO_C program). Register R26, which is involved in the mechanics of pro- 
cedure calls and returns, holds the location counter address of the 1dgp pseudo- 
instruction immediately below each jsr instruction, and the assembler generates a 
two-instruction sequence with numerical offsets as appropriate. 


In order to conform to the calling convention, the prologue of IO_C differs from 
those of our earlier programs in several respects: 


e Stack space of FRAME bytes must be allocated by decrementing the stack pointer 
register SP. FRAME must always be a multiple of 16 bytes (one octaword) for 
Unix, and we have devised an assembler expression to guarantee this. 


e Register R26 must be saved on the stack (i.e., in memory) at an offset of zero from 
the just-modified stack pointer. 


e The .mask directive must specify a bit-encoded pattern of register numbers that 
have been saved on the stack, followed by the offset in bytes to the region where 
those registers are saved. 


e The . frame directive must specify the positive size of the stack frame. 


Proper operation of the debugger and recovery from errors detected by the system soft- 
ware require these details. Similarly, a somewhat longer exit sequence occurs at the end 
of the program, where register R26 must be restored to the original exit address and the 
stack pointer must be restored by a precise compensating amount. 

The files containing the C routines and the assembly language program can be 
processed in a single use of the compiler driver (cc command). Linking will be auto- 
matic if no errors occur during compilation/assembly. Suitable system commands are as 
follows: 


> cc =00 =o io_c io_c.s getput.c 


> 10€ 


Separate object modules, io_c.o and get put . o, are produced in addition to an exe- 
cutable output file (assuming no errors). These files can be re-used on a cc command 
line when some but not all of the constituent modules comprising a program have to be 
corrected or modified. Since the debugger is not needed for I/O, the -g option can be 
omitted from the cc command unless we wish to be able to step through the program to 
monitor its operation. 
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The linker collects the data and executable portions from the IO_C program and 
the other two modules into a unified address space and assures that all necessary inter- 
module pointers are within reach of the global pointer register at run time. 

The executable file takes the name specified with the -o option on the cc com- 
mand line, or else becomes a. out by default. Only one module going into a com- 
posite program should contain the main designation. Note that neither C routine in 
getput .c is designated as main, in order to instruct the C compiler to make them 
callable routines instead. 

If we type some characters, say abc, and then push the return (or enter) key, the 
program should output the first character that we typed on a line by itself. This program 
will “hang” and do nothing until we have indeed pushed return (or enter), because of 
the way that the operating system processes terminal input a line at a time instead of a 
character at a time. No characters are accessible to chrget (getchar) until a whole 
line has been received. Similarly, no output will become visible until a program has sent 
out a “newline” character. 


BACKWARD: Using Byte Manipulations 


We now develop a somewhat more complex program that combines some of the Alpha 
byte manipulation instructions with the capability just illustrated for obtaining or print- 
ing characters. This program, which will be called BACKWARD (Figure 6.8), obtains 
and stores a line of text. Then it accesses those bytes in such a way that the order of the 
bytes has been completely reversed. Input such as evil will be printed out as Live. 





/* BACKWARD Print string backwards (Unix) 
This program prints a prompting phrase, pauses for the user 
to type in a line of text, and prints that text backwards. */ 


NL = "\n" # Newline character 

LEN = 96 # Line length allowed 

STACK = 1 # Quadwords needed 

FRAME = ((STACK*8+8) /16) *16 

. data 

. comm text, LEN # Storage size (<=255) 

. text # Section for program code 

-align 4 # Octaword alignment 

.set noreorder # Disallow rearrangements 

.globl main # These three lines 

.ent main # mark the mandatory 
main: = 'main' program entry 


170 


First 


1 Loop: 


oloop: 


line: 


done: 
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ldgp $gp, 0 ($27) # 
lda sp, -FRAME ($sp) # 
stq $26,0(S$sp) # 
.mask 0x04000000, -FRAME 
.frame $sp,FRAME,$26,0 # 
.prologue 1 # 
sGlobl Tirst 

lda $12,text z 
addq $12, (LEN-1),$13 # 
mov S12 , SLL 5 
jsr $26,chrget # 
ldgp $gp, 0 ($26) # 
cmpeq S0: NL; $1 n 
blbs $1; 0100p # 
ldq_u SIT; (S11) # 
insha s0 SLL $2 = 
mskbl Sl. , 611, $4 # 
bis S12, SI # 
sata ü $1, (611) # 
lda S141 tet.) # 
cmpule $11,$13,$1 + 
blbs S1,i1Loop = 
lda $11,-1($11) P 
empult 511,$12,51 # 
blbs $1,line = 
ldq_u S16, ($11) # 
extbl 516,511,516 # 
Jar §26,chrput # 
ldgp $gp,0 ($26) # 
br $31,0loop F 
mov (NL) , $16 = 
jsr $26,chrput # 
ldgp Sgp,0 ($26) # 
mov 0,$0 # 
ldq $26, ($sp) # 
lda Ssp,FRAME(Ssp) # 
ret S31, (526) ,4 = 
. end main F 


Load the global pointer 
Allocate stack space 

Save our own exit address 
# Saved only register R26 
Describe the stack frame 
Say that $gp is in use 


R12 -> first byte position 
R13 -> last byte location 
R11 = moving pointer 
RO = character obtained 
Restore global pointer 
If char in RO is NL, 
go to output section 
Load quadword with the byte 
Shift the new byte 
Old bytes with a hole 
Whole quadword again 
Store revised data 
Advance the pointer 
If some room remains, 
go get next character 
Decrement the pointer 
If back at the start, 
finish up with newline 
Load quadword location 
Extract desired byte 
Send the character 
Restore global pointer 
See whether done yet 
Newline 
Send the newline 
Restore global pointer 
Signal all is normal 
Restore exit address 
Restore stack level 
Back to Unix environment 
Mark end of procedure 


Figure 6.8 BACKWARD: An illustration of accessing bytes 


This example shows rather well why the word for computer in the French lan- 
guage is ordinateur, that is, a device with the capability for arranging information not 
just for performing numerical calculations. 

We should draw attention to a few other aspects of this program that might other- 
wise slip too easily past a reader’s attention. First, we made a decision to control both 
program loops using address limit comparisons (cmpu instructions). The output loop 
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might have been implemented using a down-counter since the total number of charac- 
ters would then be calculable (how?). Nevertheless, using address comparisons uses the 
same preserved boundary addresses as in the input loop and makes the program seem 
more internally consistent in approach. 

Second, we have not allocated registers haphazardously. We used registers RO 
through R2, which may be volatile, in several short-term situations. Conversely, we 
used registers R11 through R13 (Unix) or R11 through R15 (OpenVMS) for long-term 
address pointers that would be needed before and after calling the external input/output 
routines. By the calling conventions, any called routines must never perturb the contents 
of those particular registers. We isolated directly in register R16 the byte value for the 
character to be printed. 

Third, since no computer ever has enough registers to satisfy every programmer’s 
need, registers have to be reused. Notice that register R1 is used for three different 
short-term purposes on each pass down through the input loop, for instance. 

The algorithm in this program is quite simple. In the input loop, register R11 is 
stepped toward increasing address values. In the output loop, register R11 is stepped 
toward decreasing address values. The “standard sequences” given earlier in this chap- 
ter for accessing bytes in memory storage have been streamlined to the maximal extent 
possible for this present application. 

Figure 6.8 shows only the Unix variant of BACKWARDS. No significant differ- 
ences are required for an OpenVMS variant (included on the CD-ROM accompanying 
this book) in the region from first to done, apart from the different style of register 
names and manner of calling external support routines. 

People have long had a certain fascination with optical mirroring (for example, the 
notebooks of Leonardo da Vinci) and with letter and word reversals. Test examples for 
BACKWARDS could include: 


123456789 

Erewhon (Samuel Butler, 1872) 
Emit no star spin mood 
Sore spots lived on 


Perhaps you can devise additional illustrations as a pastime. 


Compare Byte Instruction 


In arithmetic, the 64-bit width of an Alpha computer improves upon a 32-bit VAX or 
16-bit PDP-11 only in extending the range of representable numbers, not in magnifying 
the number of computations performed per instruction. With data represented as inde- 
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pendent bytes, however, an Alpha does have a few instructions that can perform up to 8 
similar operations in parallel. 

The unique but versatile compare byte instruction shares opcode 10 (Table 6.4) 
with the integer addition, subtraction, and comparison instructions. Although similar in 
form to those other instruction types, the cmpbge instruction uses the second operand 
quite differently: 


cmpbge Ra,Rb,Rc ; Rc<63:8>=0; Rce<7:0> result 
cmpbge Ra,lit,Rc ; Rc<63:8>=0; Rc<7:0> result 


where a bit is set in the lowest byte of register Rc if a corresponding byte of the value in 
register Ra is greater than or equal to the equivalently located byte of the value in register 
Rb or the literal, in an unsigned comparison. The high 56 bits of register Rc are set to 
zero. Bit 0 of register Rc corresponds to byte 0 of registers Ra and Rb, bit 1 to byte 1, etc. 


Table 6.4 Alpha Miscellaneous Byte Instructions 


Function 
Mnemonic Opcode Code Purpose 
cmpbge 10 OF Compare 8 unsigned bytes 
Zap 12 30 Zero selected bytes 
zapnot 12 31 Zero deselected bytes 


EEE aaa, 


Eight simultaneous one-byte comparisons are recorded in register Rc. The result 
in the low byte of register Rc has potential as input to the zap and zapnot instructions 
that are introduced below. 

Some applications of this instruction have been suggested in the architectural ref- 
erence handbooks. The following sequence scans for a byte of zeros (i.e., a NUL char- 
acter) in a character string. 


[set R1 -> aligned quadword address of the string] 


loop: ldq R2,0 (R1) ; R2 = 8-byte string 
lda BLS (RL) ; Advance the pointer 
: Some sort of conditional loop exit is needed here 
cmpbge R31,R2,R3 > R3 = 0 if NO bytes of zero 
beq loop ; Do the next 8 


Now use R3 to determine which byte(s) held a zero 


. 
1 


The following sequence checks a string of characters in register R1 to ensure that every 
character is a numeral from 0 through 9: 
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Sdata_section 
Hives: „ascii "ZIELE" 


DITOS; vaggii Furieggesgn 
$code_section 
ldq R2,LITOs ; 8 copies of char just below "0" 
ldq R3,LIT0s ; 8 copies of char just above "9" 
cmpbge R2,R1,R4 ; Some R4<7:0>=1 if char LT "0" 
cmpbge R1,R3,R5 ; Some R5<7:0>=1 if char GT "9" 
bne R4,error ; Branch if some character too low 
bne R5,error ; Branch if some character too high 


The following sequence compares two strings for greater/less: 


[set R1 -> aligned quadword address of string 1] 
[set R2 -> aligned quadword address of string 2] 


loop: ldq R3,0 (R1) ; R3 = 8 bytes from string 1 
lda R1,8(R1) ; Advance pointer for string 1 
ldq R4,0(R2) ; R4 = 8 bytes from string 2 
lda R2,8(R2) ; Advance pointer for string 2 
; Some sort of conditional loop exit is needed here 
XOY R3; R4, R5 ; Are all 8 bytes equal? 
beq R5, loop ; Yes - now check next 8 
cmbge R31-RS/ R5 ; No - determine GT positions 


Now register R5 can be used to determine the first not-equal byte position, which will 
actually locate the first byte in string 1 that is greater than its counterpart in string 2 
since the possibility of equality has already been eliminated. The SORTSTR program 
(Figure 10.4) illustrates one application of the cmpbge instruction. 


Zero Bytes Instruction 


The last group of byte manipulation instructions includes two zero byte instructions that 
are complements of each other: 


zap Ra, Rb, Rc ; Rc = byte_zap(Ra,Rb<7:0> 1s) 
zap Ra, iit, Re ; Rc = byte_zap(Ra,lit 1s) 
zapnot Ra,Rb,Rc ; Rc = byte_zap(Ra,Rb<7:0> Os) 
zapnot Ra,lit,Rec ; Rc = byte_zap(Ra,lit Os) 


where bits <7:0> of the value in register Rb or the literal encodes a pattern of 1 and 0 bits 
that governs the manner of data transfer from source register Ra to destination register 
Rc. For zap, bytes <7:0> from the value in register Ra are zeroed according to ones in 
bit positions <7:0> in the second operand; the other bytes are copied. For zapnot, bytes 
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<7:0> from the values in register Ra are zeroed according to zeros in bit positions <7:0> 
in the second operand; the other bytes are copied. 

These instructions share opcode 12 (Table 6.4) with the byte manipulation instruc- 
tions. Bits <63:8> of the value in register Rb are ignored. When we wish to zero com- 
plete bytes rather than arbitrary patterns of bits, the zap and zapnot instructions have 
an advantage over and and bic instructions. With zap and zapnot, the literal form 
can be used to span across an entire quadword. With and and bic, a full 64-bit mask 
would first have to be loaded from memory into register Rb if we want to accomplish 
zeroing of bits within bytes 1 through 7 because an 8-bit mask in a literal could only 
specify zeroing of bits within byte 0. 

The zap and zapnot instructions have potential applications in working with 
data structures where byte-length information units are packed into aligned quadwords 
in memory. Another possible use would be in performing a conditional isolation of the 
lowest byte using the one-bit result from a previous cmp instruction as the second oper- 
and code, or perhaps the isolation of a pattern of bytes using the eight-bit result from a 
previous cmpbge instruction. 


Summary 


The Alpha architecture as originally defined does not include individual instructions to 
load or store information units smaller than longwords. The instructions in this chapter 
can be used in rather compact sequences, in conjunction with the unaligned quadword 
load and store instructions, in order to isolate and work with these smaller information 
units. Nevertheless, the best performance will always result from algorithms that instead 
work with groups of eight bytes that are aligned to quadword addressing boundaries. 

The byte-manipulation instructions resemble the logical functions and shift 
instructions, but differ by using a byte count from 0 to 7 instead of a bit count from 0 to 
63. Moreover, the byte-manipulation instructions generally zero some bytes of the 
result based on the size of the information unit specified through the function code field 
of the instruction. 

We have also shown in this chapter how to harness the convenience of the single 
character input and output functions of the C language for the simple but adequate task 
of obtaining a line of input and printing a line of output from a region of memory start- 
ing at an aligned quadword boundary. We have used—admittedly without full explana- 
tion yet—the $call1 macro (OpenVMS) or the j sr instruction (Unix), both of which 
will receive more attention in the next chapter. 

In Chapters 1 through 6 we have discussed all of the Alpha integer-related instruc- 
tions. We have presented a basic selection of programming techniques, including a 
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workable method for full-line input and output. The reader who has come with us this 
far has attained a minimal toolset for writing any non-proceduralized program involving 
character-string or integer data on the Alpha. 
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EXERCISES 


6.1 Suggest why the Alpha does not have an extbh instruction. 


6.2 What would be the shift amount used at execution time by ext t1 instructions if the 
second operand Z(R11) has R11 = 30033 (hexadecimal) and Z = 14 (decimal). 


6.3 Verify that the “intended sequence” of instructions to load a quadword from an 
unaligned address will work when the effective address modulo 8 is two. 


6.4 Work out homologues for the set of cases for the value 3 for the shift parameter that 
would correspond to: 


a. Figure 6.1 (ext instructions) 
b. Figure 6.3 (ins instructions) 
c. Figure 6.4 (msk instructions) 


6.5 Verify that the “intended sequence” of instructions to store a quadword at an 
unaligned address will work when the effective address modulo 8 is two. 


6.6 The Alpha does not have a cmpge integer signed compare instruction. Explain 
whether the cmpbge instruction using a literal fills part of that void. 


6.7 Explain why the output of the IO_C program produces one more line of output, if 
you enter no characters before pushing the return (or enter) key, than when you do 
supply one or more other characters first. 


6.8 The standard for the C language says that putchar “returns the character written” 
(unless a system-detected error occurs). Use the debugger to verify this for chrput. 
The sensible places to look are registers RO and R16. 


LL 
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6.9 Use the debugger to find out what registers are involved with (a) $cal1 sequences 
in the IO_C program for OpenVMS and/or (b) j sr instructions in the IO_C program 
for Unix. 


6.10 Describe the different uses of register R1 in the BACKWARD program. 


6.11 Adapt BACKWARD to edit a sentence by inserting the word “not” at a predeter- 
mined location. 


6.12 Combine and adapt relevant portions from SCANTEXT, DECNUM, IO_C, and 
BACKWARD to build a program that prints on the screen the number of words found 
in a line of input from the keyboard. 


6.13 Let registers Ra and Rb contain, respectively, the full quadwords 
FEDCBA9876543210 and 0123456789ABCDEF. What would be the value in regis- 
ter Rc after execution of each of the following instructions? 


a. ext ty (your instructor may or may not expect you to do all 7 cases) 
b. ins ty (your instructor may or may not expect you to do all 7 cases) 
c. mskty (your instructor may or may not expect you to do all 7 cases) 
d. cmpbge 

e. zap 

f. zapnot 


6.14 Write a program that prompts for a decimal number with arbitrarily many digits and 
formats it using commas to set off the digits in groups of three, i.e., according to the 
standard pattern x,xxx,xxx. You will need to determine the length of the input string 
of digits. Test all the cases (length modulo 3). 


6.15 Puzzle solvers know about palindromes, or words that read the same backwards and 
forwards (RADAR, OTTO). 


a. Write a routine to test whether a word is a palindrome. As you plan the loop in 
which some sort of comparison instruction occurs, think carefully about an appro- 
priate exit condition. Consider carefully whether the two cases of even and odd 
numbers of letters require special care. Test thoroughly. 


b. (More difficult) Write a routine to test whether a sentence in mixed upper/lower 
case that contains appropriate punctuation for reading in the forward direction is 
palindromic. Punctuation characters and spaces are to be ignored. Some palindro- 
mic sentences are as follows: 


Madam, I’m Adam. 

A man, a plan, a canal: Panama! 

He won now, eh? 

Doc, note, I dissent. A fast never prevents a fatness. I diet on cod. 
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Was it a bar or a bat I saw? 
Norma is as selfless as I am, Ron. 
Able was I ere I saw Elba. 

Poor Dan is in a droop. 
Sununu’s tonsil is not Sununu’s. 
Sonar possesses sopranos. 

Tell a body—do ballet. 

Never ever ever even. 

A Toyota. 

Won ton? Not now. 

Zeus was I ere I saw Suez. 


b. (Much more difficult) Write a routine to test whether a sentence in mixed upper/ 
lower case with appropriate punctuation for reading in the forward direction is 
palindromic by whole words, not by characters. Punctuation characters are to be 
ignored. Some word-palindromic sentences are as follows: 


All for one and one for all. 

In order to stop hunger, stop to order in. 

Fall leaves after leaves fall. 

First ladies rule the state and state the rule, “Ladies first.” 


6.16 The EDT text editor supplied with OpenVMS has a Change Case command that will 
modify a highlighted segment of text by converting A-Z to a-z and, simultaneously, 
a-z to A-Z while leaving all other ASCII characters unchanged. Write a routine 
incorporating some bis, bic, or xor instructions that will emulate this EDT com- 
mand. Test its operation on something of your own invention and on the following: 


.asciz /"'Why, I don't know Bertha,' Bert said."/ 
Spaces and punctuation marks must not have been altered. 


6.17 (Strongly recommended) Incorporate all of the additional Alpha instructions from 
this chapter into your personal summary chart(s). 





CHAPTER 7 


Subroutines and 
Procedures 


(iane | through 6 have discussed many funda- 
mental aspects of computer architecture with illustrations relating to the integer load 
and store instructions, the integer operate instructions, and the related loop control 
instructions of the Alpha architecture. Except for the logical functions, most of the inte- 
ger operations have floating-point counterparts in RISC computers in general and in the 
Alpha in particular, which we shall defer until Chapter 8 rather than take them up now. 
Instead, we turn next to the extraordinarily important topic of subroutines and proce- 
dures, for these segmentation capabilities make possible the division of large program- 
ming efforts through teamwork, the re-use of previously developed and thoroughly 
tested routines, and standardized interfacing to important and powerful vendor-supplied 
libraries of support routines. 


Instructions for invoking subroutines and procedures may be likened to a boomer- 
ang, while an unconditional GOTO instruction might be likened to a ballistic projectile. 
Just as a boomerang is supposed to return to the vicinity from which it was hurled, a 
subroutine usually returns to the instruction immediately following in line after the 
“call” or “jump to subroutine” instruction. Since subroutine calls can be nested—even 
recursively—some provision for remembering a sequence of return addresses is needed. 
Some form of last-in first-out (LIFO) stack in memory provides the most general 
method, because register sets are always limited. 


We will thus expand the discussion of stacks begun in Chapter 4. Then we will 
discuss and illustrate simple subroutines of the sort that can be commonly incorporated 
within the same source file as the main routine that calls them. 
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The major portion of this chapter explains the more general forms of subroutines, 
which are called procedures and functions in some programming environments includ- 
ing Unix and OpenVMS. This body of material includes the conventions for register 
usage, the methods for passing arguments (i.e., data and parameters) between calling and 
called routines, and examples of support routines supplied with an operating system. 


Stacks 


Operating systems establish at least one stack for general support of required capabili- 
ties such as input and output. Often, instruction sets and programming environments 
permit a programmer to establish other stacks. 


Stack Addressing 


Many CISC architectures support autoindexed addressing modes that make possible a 
very easy and fully general method for “pushing” and “popping” individual items onto 
and then from a stack. Many assemblers designate either a purpose-built register or one 
of the general-purpose integer registers as the SP register, the stack pointer. 

Two conventions are possible for maintaining a stack using the stack pointer. In 
one of them, autoincrementing of SP is associated with pushing and autodecrementing 
of SP with popping. With this convention, the stack origin is at a low address, and the 
stack grows in the direction of increasing memory addresses and shrinks back in the 
direction of decreasing memory addresses. This resembles the way we may manage a 
last-in first-out (LIFO) stack of paperwork on a desktop. Digital Equipment Corpora- 
tion’s operating system environments have historically followed the opposite conven- 
tion. Autodecrementing of SP is associated with pushing and autoincrementing of SP 
with popping. With this convention, the stack origin starts at some preset address when 
the program is loaded for execution, and the stack grows in the direction of decreasing 
memory addresses and shrinks back in the direction of increasing memory addresses 
when stack storage cells are relinquished.. 

The use of the addressing modes —(SP) to push and (SP)+ to pop items from the 
stack on a PDP-11 always used a full 16-bit word (2 bytes) for each item, irrespective of 
the actual size of the information unit (byte or word). These same modes, —(SP) to push 
and (SP)+ to pop, work on a VAX in concert with information implied by variants of the 
opcode to specify the number of bytes of stack space needed for the information unit, 
e.g., movt (t=b,w,l,q for 1,2,4,8 bytes). Since such addressing modes facilitate both 
data movement and pointer adjustment within a single instruction, we need not neces- 
sarily see a large adjustment being made to the address pointer value in the register SP. 
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In a load/store architecture, on the other hand, pointer adjustment and data move- 
ment require separate instructions, and there is an “economy of scale” if register SP is 
modified so as to reserve or relinquish several stack cells all at once. The Alpha uses 
only quadword accesses for the stack. If an Alpha routine needs three quadwords of 
local storage, for example, the following scheme might be appropriate: 


lda SP,-3*8(SP) ; Reserve 3 quadwords 

stq Rx,2*8(SP) ; Most deeply buried item 
stq Ry, 1*8 (SP) ; Next-most deeply buried 
ldq Rq, 2*8 (SP) ; Get copy of original Rx 
lda SP, 3*8 (SP) ; Pop 3 quadwords 


Here the individual multipliers could be defined as symbols (6.8. A=, Ysl, Z0) 
appropriate to the significance of each local data element. (See caution about avoiding 
an odd number of quadwords on the stack in the Unix subsection below.) 


Within any logically distinct portion of code between two 1da instructions that 
modify the address value in register SP, only the stack quadwords located within the 
range implied by such a matching pair of 1da instructions are logically accessible (see 
Figure 7.1). Values at higher addresses are out of scope; they do not “belong” to this 
portion of code, but may constitute valuable data belonging to a calling routine. Values 
at lower addresses are unknown and are also out of scope. 


Figure 7.1 Addressing quadword values stored on the stack 





High addresses 







: 2*8(SP) 
: 1*8(SP) 
: 0*8(SP) 








Low addresses 





In some environments, including the PDP-11 and the VAX but not the Alpha, the 
stack is used to hold the return address of the caller when a subroutine linkage is 
invoked. 
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Stack Conventions (Unix) 


The vendor documentation for the Digital Unix programming environment specifies a 
calling convention in which the stack pointer (SP) must always have an octaword align- 
ment (lowest 4 bits of SP should be zero). As a corollary, just one individual quadword 
of stack space (or any other odd multiple like 3 as sketched out just above) should not be 
claimed. In addition, the Unix calling convention entails modifying SP only as part of the 
prologue of a procedure and just once more as part of the exit sequence of a procedure. 
The Unix system software environment consistently uses Alpha register R30 as the 
principal stack pointer. The Unix assembler recognizes $sp as a functional synonym for 
the less informative $30. When a program is loaded to run, the stack pointer $sp is ini- 
tialized to point to a large allotment of virtual memory addresses for the user’s process. 


Stack Conventions (OpenVMS) 


Sometimes the number of stack quadwords cannot be specified at assembly time; in 
such cases, an Alpha program would use two-instruction sequences for pushing and 


popping: 


lda SP,-8(SP) > Claim one quadword 

stq Rx, (0 (SP) ; and push contents of Rx 
ldq Rx, 0 (SP) ; Pop data into Rx 

lda SP,8(SP) > and release one quadword 


Perhaps an item to be popped from the stack is no longer needed at all. While it could 
be “loaded” into register R31, that would be a superfluous instruction; the 1da adjust- 
ment of register SP would be sufficient. The stack is maintained with quadword align- 
ment at all times in the OpenVMS programming environment. That is, the lowest 3 bits 
of SP should be zero. 

The OpenVMS system software environment consistently uses Alpha register R30 
as the principal stack pointer. The MACRO-—64 assembler recognizes SP as a functional 
synonym for the less informative R30. When a program is loaded to run, the stack 
pointer SP is initialized to point to a large allotment of virtual memory addresses for the 
user’s process. 


User-Defined Stacks 


Because of the very restrictive conventions for stack usage in the Unix programming 
environment, we are going to present some examples of programming where SP is 





DECNUM2: Stack Usage in an Algorithm 183 


manipulated only for purposes related to procedure calling, and where another register 
(typically R9) is used as a stack pointer for algorithmic purposes. In general, a program- 
mer or a compiler could establish several stacks using additional registers as stack 
pointers. 

In order to emphasize when a register is an address pointer, we will use 1da 
instructions for pointer adjustment (unless the adjustment is larger than a signed 16-bit 
integer amount): 


lda R9,-8 (R9) ; Claim one quadword 

stq Rx,0(R9) ; and push contents of Rx 
ldq Rx, 0 (R9) ; Pop data into Rx 

lda R9,8(R9) ; and release one quadword 


We also emphasize that no branching may ever occur into a local code block like this 
which is bounded by paired 1da instructions that manage a stack. 


DECNUM2: Stack Usage in an Algorithm 


Let us revisit the DECNUM program from Chapter 5, generalizing it to be able to for- 
mat decimal numbers having more than eight digits. We establish a programmer- 
defined stack, where we can use quadword storage units one at a time as each digit, 
from right to left, is first computed as a remainder and is then converted to an ASCII 
character. Afterwards, these stacked items are transferred using the byte-move tech- 
nique explained in Chapter 6 into the left to right ordering that we would need for print- 
ing. These ideas are implemented in the DECNUM2 program shown in Figure 7.2. 
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ee 


/* DECNUM2 Convert integer to ASCII (Unix) */ 


/* This program converts a positive number from N2 
into a string of ASCII-encoded decimal digits at A2. */ 


LEN = 20 # Allowance for 20 digits 

.data 

.comm STACK, 8*LEN # Space for user-defined stack 
DOTS : . quad Oxcccccecceccecceccceccccd # 0.8 (finite approx.) 
N2: . quad 0x12345678 # Number to convert 

. Comm A2, 80 # Space for result 

.text # Section for program code 

.align 4 # Octaword alignment 

.set noreorder # Disallow rearrangements 

.globl main # These three lines 

.ent main # mark the mandatory 
main: = 'main' program entry 

ldgp Sop, 0 €S27) # Load the global pointer 

.frame Ssp,0,$26,0 # Describe the stack frame 

.prologue 1 # Say that $gp is in use 

globil first 
new: lda $9,STACK+(8*LEN) # R9 -> user stack 

ldq 52, DOT8 # Get reciprocal factor 

ldq $1,N2 # Get number to convert 

lda $9,-8(S9) # Reserve a quadword 

stq $31,059) # and store 0 as a flag 
divl: umulh 1,82; 52 # R3 = high 64 bits 

sri ETEA # R3 = quotient = R3/10 

mulg S310654 # Rå = 10*(R3/10) 

subq S1, 54, 64 # R4 = remainder now 

bis $4,0x30,$4 # Make into ASCII char 

lda $9,-8(S9) # Reserve a quadword 

stq S$4,0($9) # and store character 

beq $3,store # Done if 0 quotient 

mov $3,S$1 # Move quotient to R1 

br S31,Havl # and go divide again 
store: lda $1,A2-1 # R1 = pre-indexed pointer 
20: lda SL ks CSL) # R1 -> where to store char 

ldg $2,0($9) # R2 = character 

lda $9,8(S9) # popped from user stack 

beq $2,done # Done if end flag seen 

ldq_u $3, ($1) # Access quadword, 

insbl S2,51,52 # insert new byte, and 

mskbl Soe 54 8 retain other seven 

bis 52, 54,53 # Merge and 

stq_u $3, ($1) # store this result 

br S31,20D # See if more digits 
done: mov 0.80 # Signal all is normal 

ret S31; (S26) 2. # Back to Unix environment 

.end main # Mark end of procedure 


Figure 7.2 DECNUM2: An illustration of stack usage 
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We have changed some of the register assignments in making edits from DECNUM 
to DECNUM2 and have made certain other design choices with a view toward another 
transformation of the heart of this program into a callable subroutine, a little further ahead 
in this chapter. 

Since the number of digits is indefinite, and indeed cannot be predicted in any 
simple way from the magnitude of the number being converted, we need some method 
for marking the number of digits generated and pushed onto the stack. Several possibil- 
ities could be considered. First, we could use some sort of counter. Second, we might 
preserve the value of the stack pointer register R9 before the stack-building loop is 
entered. With that choice, the stack-relinquishing loop would be controlled by some 
sort of address comparison, requiring one register to hold the original pointer value, a 
cmpu instruction, and another register to hold the result from the comparison to be 
tested in a branch instruction. Third, we might be able to think of some special flag 
value that would never be an actual valid data entry stored on the stack during the oper- 
ation of the algorithm. In this instance, we will store a binary zero (the NUL character 
code). This latter method, when it can be used, usually requires fewer registers and/or 
fewer instructions for loop control. 

We can use the debugger to verify the correct functioning of this program, or we 
could extend it to use the input/output routines of the C language to display the format- 
ted number (as we suggest in one of the exercises). The expected answer is 305419896. 

As perhaps this application shows, stacks are extraordinarily versatile data struc- 
tures that find numerous applications in computer science, such as parsing and evaluat- 
ing arithmetic expressions which contain parentheses. The “reverse Polish notation” 
operation of certain small calculators is another familiar example, where the several 
encountered values of an arithmetic expression are temporarily held in a stack until the 
operators come along later. 

In programming at the assembly language level, stacks hold saved copies of data 
from registers while there are more pressing current needs for those registers. When 
procedures are called, the stack is used to preserve the caller’s context, to furnish the 
called routine’s requirements for temporary storage, and perhaps to pass arguments 
when there are too many to be held in registers. 


Jump and Subroutine Instructions 
Standard programming practice promotes the use of modular subroutines, procedures, 


and functions for several reasons: 


¢ Ease of development: Large programming projects can be logically divided into well- 
defined portions more commensurate with the capabilities of individual programmers. 
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e Reusability: Well-documented routines for scientific or statistical calculations 
have been produced, maintained, and marketed for decades. 

e Reliability: Advances in the theoretical analysis of algorithms have shown that an 
arbitrarily large program may still contain one or more still-undiscovered “bugs” 
but that sufficiently small routines can be empirically tested or mathematically 
proved to give correct results. 

e Maintainability: The appearance of repetitive sections of instructions in a large 
program motivates people to compartmentalize one standardized version of the 
sequence of instructions as a subroutine that can be invoked from several different 
places in the program. Then any necessary revisions in the future can be precisely 
targeted, with a lowered risk of introducing errors in some copies. 

e Reducing memory requirements: If several copies of a block of instructions can be 
replaced by one carefully documented instance, there can be substantial reduc- 
tions in overall program size. 


Of these motivations, the first four now stand well above the fifth in precedence. The 
cost of memory storage has receded in importance over time, while the other concerns 
remain in full force with overriding economic impact because of labor costs and issues 
of liability for program failures. 

Minimally, a subroutine linkage requires some means for “remembering” the 
return address or resumption point, which is the updated value that would have been in 
the program counter (PC) if the subroutine call instruction had been some ordinary non- 
branching instruction instead: 


Calling Routine Called Routin t ine 
Instruction sequence 
Jump to subroutine X X: Entry point 


Instruction sequence 
Return from subroutine 
Resumption point 


Many architectures use a system-mandated stack to hold the return address, which is 
ordinarily pushed by a jump to subroutine instruction and popped back into the PC as 
part of the action of a return from subroutine instruction. A few computer architectures 
have a small stack implemented within the central processor as a specialized bank of 
registers to be used exclusively for holding return addresses. 

For the more common circumstance that the stack is implemented as a block of 
memory, the jump to subroutine and return from subroutine instructions would seem to 
involve implied load and store instructions. The design of a RISC architecture requires 
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careful attention to the complexity of instructions. The subroutine control instructions 
warrant special scrutiny because they have to preserve a former value of the program 
counter (PC) while establishing another new value in the PC. For this reason, the 
designers of the Alpha architecture eliminated any implicit involvement of a stack from 
these instructions per se, although in practice complete calling sequences would also 
include explicit load and store instructions in order to preserve the calling routine’s con- 
text data. 

The Alpha instruction set includes five similar instructions (Table 7.1) that are pri- 
marily intended for unavoidable distant jumps and for calling and returning from sub- 
routines, procedures, and functions. We will discuss these instructions in the next 
several sections of this chapter. 


Table 7.1 Alpha Jump and Subroutine Instructions 


Mnemonic Opcode Bits <15:14> Purpose 
bsr 34 not applicable Branch to subroutine 
jmp 1A 00 Jump (unconditional) 
jsr 1A 01 Jump to subroutine 
ret 1A 10 Return from subroutine 
jsr_coroutine 1A 11 Jump to coroutine 





Branch to Subroutine Instruction 


The branch to subroutine (bsr) instruction (opcode 34) uses the same binary format as 
all of the other branch instructions: 


31 26 25 21 20 0 


Recall that the conditional branch instructions test the value in register Ra and then 
either “fall through” or else “take the branch” by modifying the already-updated value 
in the PC through addition of a signed displacement derived from the bit field <20:0> 
shifted left two positions. In contrast, the unconditional branch instruction (br) 
replaces the value in register Ra with the already-updated value in the PC and then 
always “takes the branch” as just described. 

The bsr instruction does the same thing as the br instruction. The only differ- 
ence is that a particular Alpha implementation may be able to make useful assumptions 
aimed toward performance optimization by noting the different opcode. For us, the dis- 
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tinction is semantic and mnemonic. With bsr, we expect to see an eventual return to 
the instruction in the longword in the code stream that follows the bsr instruction, and 
register Ra will hold that return address. With br, we do not expect to see any return, 
and Ra is almost invariably the nonmodifiable register R31. 


NUMOUT: A Local Subroutine 


We now illustrate bare-bones subroutine linkage using bsr and ret instructions by 
separating the DECNUM2 program into a subroutine portion called NUMOUT (Figure 
7.3) and a main program portion called TESTNUM (Figure 7.4). The binary representa- 
tion of 0.8 (DOTS) has been placed in the data section shown with NUMOUT, while the 
rest of the items in the data section in DECNUM2 have been placed in the data section 
shown with TESTNUM. A calling program does not need to know about internal algo- 
rithmic details like DOTS that a subroutine may require. 





/*  NUMOUT - ASCII numeric output (Unix) */ 


/* On entry: 
RO -> Output area (no bounds checks) 
R1 = Number to convert 
R9 -> Stack to be used 
GP to provide access to data section 


On exit: 
ASCII output @RO (no terminator) 
R1 -> End of string + 1 */ 


.data 

Aligu 3 # Ensure quadword alignment 
DOTS . quad Oxccecececedeccecced # 0.8 (finite approx. ) 

.text # Section for program code 

-align 4 # Octaword alignment 

.set noreorder # Disallow rearrangements 
numout: 

lda $9,-4*8(S9) # Reserve 4 quadwords 

stq 2, 1*8 ($9) # Save R2 

stq $3,278 ($9) # Save R3 

stq $4,3*8 ($9) # Save R4 

stq $31,0*8(S$9) # Set 0 as end flag 

ldq $2, DOTS # Get reciprocal factor 
divi: umulh Si. $2,535 # R3 = high 64 bits 

srl S35 2S # R3 = quotient = R3/10 

mulq $3, 20,84 # R4 = 10*(R3/10) 


NUMOUT: A Local Subroutine 189 


subg S1.,. 4, $4 R4 = remainder now 

bis $4,0x30,$4 Make into ASCII char 

lda $9,-8(S9) Reserve a quadword 

stq $4,0($9) and store character 

beq $3,store Done if 0 quotient 

mov S3,61 Move quotient to R1 

br $31,divl and go divide again 
store: lda $1,-1($0) R1 = pre-indexed pointer 
20: lda Si,1¢$1) R1 -> where to store char 

ldg $2,0($9) R2 = character 

beq S2, 00t Done if end flag seen 

lda $9,8 ($9) else pop from stack 


Access quadword, 
insert new byte, and 
retain other seven 


ldg_u oo, lou) 
insbl 52, 51,52 
mskbl $3,$1,'$4 


+ + 4+ HF HF OH OH OH OH OH OH OH OH OH OH OH OH OH OH OH OH H 


bis S$2,54,'$3 Merge and 

stq_u So, ($1) store this result 

br $31,205 See if more digits 
Outs ldq S4,3*8(S$9) Restore R4 

ldq S3,2*8 ($9) Restore R3 

ldq $2,1*8 ($9) Restore R2 

lda $9,4*8(S9) Release 4 quadwords 

ret $31, ($26) # Return 


Figure 7.3 NUMOUT: An illustration of a simple subroutine 


Prerequisite to any successful program segmentation scheme are specifications of 
inputs, outputs, and side effects that are well thought out and clearly defined. Thus 
comments at the top of NUMOUT explain what it expects on input and what its cumu- 
lative effects have been by the point that it exits back to its caller. 

On entry, NUMOUT expects register R1 to contain an unsigned number, register 
RO to contain the address of a memory area where the converted ASCII representation 
of a number is to be deposited, and register R9 to point to an adequate stack region. In 
addition, NUMOUT expects to be able to access its data region using the global pointer 
(Unix) or using register R15 for base addressing (OpenVMS) in order to find DOTS. 

On exit, NUMOUT has produced an unterminated string of ASCII characters repre- 
senting the decimal number. Register RO has been unaltered, and thus still points to the 
first character of that string. Register R1 has been changed, and now points to the byte just 
beyond the last character of the output string. This means that the calling routine can 
determine the length of the generated string by computing the difference, R1—RO. 

The central portion of NUMOUT (Figure 7.3) corresponds quite closely to the 
central portion of DECNUM2 (Figure 7.2). We need to point out several precautions, 
however. First, we have used an assembler directive (.align) to make sure that the 
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DOTS8 constant is quadword aligned, since it will be loaded using an instruction (1dq) 
that requires aligned data. The assembler also needs a base register for the 1 dq instruc- 
tion, and we have included for this reason a comment as a reminder that the global 
pointer (Unix) or pointer to the data segment (OpenVMS) from the calling program is 
important. The stack managed by register R9 is used both for the same intrinsic algo- 
rithmic purpose as in DECNUM2 and for the additional purpose of saving the values 
originally in registers R2 through R4. (We are pretending that the caller would be 
retaining valuable information in those three registers, even though that is not actually 
true for TESTNUM.) 

Working with subroutines necessitates utmost respect for stack levels under all 
conditions of program branching. In this instance, the beq exit test from the store loop 
must occur one instruction earlier in NUMOUT (4 instructions beyond the label 
store) than it did in DECNUM2 (5 instructions beyond the label store), if the exit 
coding at out is to proceed with the same stack level as the entrance coding at the 
opening of the subroutine. Note that the comment fields of lines in the vicinity of this 
beq instruction have been reworded to make good sense again. Comments must co- 
evolve when changes are made in the instructions themselves. 

NUMOUT illustrates both of the methods for allocating quadwords of stack stor- 
age that we mentioned earlier in this chapter. Four quadwords are reserved with a single 
lda instruction (for preserving registers R2 through R4 and for holding the end flag) 
and similarly released with a single 1da instruction. In the divide loop, one quadword 
is taken as needed for every generated digit; these quadwords are released one by one in 
the store loop. 

The routine ends with a subroutine return (ret) instruction. There is no .end 
directive because this source file will not be assembled alone (see the next paragraph). 
We specify register R31 as the register to be stored into (except that R31 is always 
zero!), in order not to lose the contents in any other register. The caller of NUMOUT 
has to have put the address for its own resumption point into register R26. Obviously 
NUMOUT itself must not modify that register value, which is needed in order to “find 
the way back” to the caller. 

How can we test NUMOUT to ensure that its behavior is correct? We have taken 
most of the remaining portions of the DECNUM2 program and have added a segment 
to call the C language output function. Those features comprise the short main program 
TESTNUM (Figure 7.4). The assembler directive #include (Unix) or .include 
(OpenVMS) is used to insert the entire text file containing the NUMOUT subroutine 
just above the . end directive of TESTNUM. Overall, then, the assembler will process 
a source stream that includes both the caller and the called. 
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/* TESTNUM ASCII numeric output (Unix) */ 


/* This program tests the NUMOUT subroutine */ 


NL = nyi" # Newline character 

LEN = 20 # Allowance for 20 digits 

STACK = 2 # Quadwords needed 

FRAME = ((STACK*8+8) /16) *16 

.data 

.comm BLOCK, 8*LEN Space for user-defined stack 
N3: . quad Oxfedcba9 Number to convert 

. comm A3, 80 Space for result 

. text Section for program code 

„align 4 Octaword alignment 


= 
= 
= 
> 
= 
.set noreorder # Disallow rearrangements 
.globl main # These three lines 
# 
= 
= 
= 
# 


.ent main mark the mandatory 
main: 'main' program entry 

ldgp Sap, OfS27) Load the global pointer 

lda Ssp, -FRAME(Ssp) Allocate stack space 

stq $26,0(S$sp) Save our own exit address 

.mask 0x04000000,-FRAME # Saved only register R26 

.frame S$sp,FRAME,$26,0 # Describe the stack frame 

.prologue 1 # Say that Sgp is in use 

Globl first 
new: lda $9,BLOCK+(8*LEN) # R9 -> user stack 

lda $0,A3 # RO = storage pointer 

ldq $1,N3 # Get number to convert 

bsr $26,numout # Note that NUMOUT changes 

# R1 -> next free byte 

null: mov (NL) ,$12 # R12 = newline code 

ldq_u i ee CSL) # Access quadword, 

insbl $12,,$17$12 # insert new byte, and 

mskbl = M ee eG E 1 = retain other seven 

bis Si2,S11,,5211 # Merge and 

stq_u Silty (od # store this result 

subq S05. £ «Sl # R11 -> string (pre-indexed) 

mov Si, 512 # R12 -> end (the newline) 
oloop: lda SILILI) # Advance the pointer 

ampule $12,511, $1 # If beyond the newline, 

blbs $1,done # then all done 

ldq_u S16; (S11) # Load quadword location 

extbl AR eo e BLS # Extract desired byte 

{SY $26-Chrput # to send 

ldgp Sgp, 0 ($26) # Restore global pointer 

br $31,o0loop # See whether done yet 
done: mov 0,$0 # Signal all is normal 

ldq $26, ($sp) # Restore exit address 

lda Ssp,FRAME($sp) # Restore stack level 

ret Sx 1S2O) ve # Back to Unix environment 
#include "numout.s" # Insert subroutine code 

.end main # Mark end of procedure 


Figure 7.4 TESTNUM: An illustration of local subroutine linkage 
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As before, we are displaying only one variant of NUMOUT and TESTNUM (for 
Unix). Suitable system command lines would be as follows: 


> cc -g -00 -o testnum testnum.s getput.o (Unix) 
> testnum 


$ macro/alpha/list testnum (OpenVMS) 
$ link/map testnum,getput 
$ run testnum 


The result displayed should be 267242409. 

If we had a larger actual main program that generated decimal numbers for print- 
ing at several places through the course of its body, we would still need only the one 
copy of NUMOUT. Prior to each bsr call, we would have to ensure that the global 
pointer (Unix) or a base register for the data section (OpenVMS) was valid and that 
those or some other register would similarly serve as a base register when the assembler 
composes the displacement to symbolic label NUMOUT for each bsr instruction. 


The Jump Group of Instructions 


We worked out in Chapter 5 that the branch format instructions can span a forward or 
backward addressing range of one million instructions from the current (updated) PC 
value. The designers of the Alpha instruction set have also provided for a longer range 
movement with the unconditional jump instruction (jmp) and three related subroutine 
call/return instructions. These four instructions (Table 7.1) are encoded as a subclass of 
the memory class instructions: 


31 26 25 21 20 16 15 14 13 O 


These four instructions share one opcode (1A), but are distinguished for the hardware 
implementations that must execute them by an assigned code in bits <15:14> in the 
instruction longword. 

All four instructions in the jump subclass perform identical operations. A new tar- 
get address is derived from the value in register Rb by masking the two lowest bits to 
zero. The updated value in the program counter, i.e., the PC value at the instruction fol- 
lowing the jump instruction, is copied into register Ra and then the target virtual address 
is copied into the PC. If Ra and Rb specify the same register, the target calculation 
using the old value is accomplished before the new value is assigned from the PC. That 
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is, if Ra and Rb are the same, the effect of the jump instruction is to interchange the 
contents of the PC and that register. 

The jmp instruction is intended as a long-range unconditional branch instruction. 
The br instruction can only move by one million instructions in either direction. The 
jmp instruction can move anywhere within the whole 64-bit address space because the 
destination address is held in a register. The br instruction is self-contained, and the 
assembler can calculate the displacement built into the instruction without reference to 
any base register. The jmp instruction requires some prior setup, usually an 1da 
instruction that itself requires some base register, when the destination address is pre- 
loaded into a register. In a sense, then, the use of jmp may involve two instructions 
where br would involve only one. We remarked in Chapter 5 that the range of Alpha 
branch instructions greatly improves upon the small PDP-11 and VAX branch ranges. 
All three architectures have jump instructions with no range limitations. 

Bits <13:0> in the jump group instructions are reserved for use by compilers, 
which may insert “hints” such as the low 16 bits of a likely branch target address for the 
jmp and j sr instructions (the two lowest bits of an instruction address are always zero, 
and do not need to be physically represented). Some implementations may have the 
capability of an anticipatory prefetch by combining these bits with other cached run- 
time information. 


Jump Tables 


Many programming languages have offered one or another method for implementing a 
multi-way branch. In classic FORTRAN, this was a so-called “computed GOTO” 
instruction in which a set of possible values of a testable integer quantity would be asso- 
ciated with an equivalent set of destination addresses. In classic BASIC, a “computed 
GOSUB’” was also offered. Not only newer languages like Pascal, but also the contem- 
porary standard FORTRAN and Digital BASIC, now offer semantics for handling 
“cases” based on arbitrarily testable conditions instead of a set of contiguous integer 
values. In Digital BASIC, this is a SELECT block: 


SELECT <selector variable or expression> 


CASE -3 

<what to do when selector = -3> 
CASE +5 

<what to do when selector = +5> 
CASE 0 

<what to do when selector = 0> 
CASE ELSE 


<what to do otherwise> 
END SELECT 
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All of the “what to do” segments may be instruction blocks of arbitrary complexity. 
They may also be of null length (in order to help define “ELSE” by a process of elimi- 
nation). The chief advantage of such case structures is the way they make obvious how 
to add further cases later, when a program is being maintained over its life cycle. 

We leave it as an exercise to design the code for such arbitrary cases, and instead 
illustrate the simpler situation where the cases are distinguished by sequential integers, 
starting from zero. 

As we consider how such high-level constructs might be implemented or sup- 
ported at the assembly language and machine language level, we may separate the 
phase of testing the selector variable from the implicit branching phase. Through a cas- 
cade of case-by-case tests, or sometimes by a clever condensed algorithm, a single 
quantity can be derived that takes on one of the values within a set which has only as 
many different values as there are different cases. 

Suppose that the first case is called case 0, the second is called case 1, and so 
forth. We could then use the s8addq scaled addition instruction to turn this case num- 
ber into a quadword displacement value and add it to a base value. The result, which of 
necessity must be in a register on an Alpha, would be just right for implementing cases 
using a jump table. 


Sdata_section 

JMPTBL: .address CASEO 
.address CASE1 
.address CASE2 
etc. 


Scode_section 


CASE: <testing logic that puts an index code into Rx> 
JUMP: .base .. ; Assure addressability of JMPTBL 
lda Rb, JMPTBL ; Rb -> JMPTBL 
s8addq Rx, Rb, Rx ; Rx := Rb + 8*Rx 
ldq Rx, (Rx) ; Get indirect address 
jmp R26, (Rx) ; Dispatch 
JOIN: 


; rest of the mainline 
; need a BR or JMP here to skip beyond cases 


CASEO: <what to do for case 0> 


jmp R31, (R26) ; Resume at JOIN 
CASE1: <what to do for case 1> 

jmp R31, (R26) ; Resume at JOIN 
CASE2: <what to do for case 2> 

jmp R31, (R26) ; Resume at JOIN 


etc. 
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The jump table itself is an ordered list of addresses at symbolic location JMPTBL, pro- 
duced using the .address directive (OpenVMS). Appropriate logical tests would 
flow to the instruction sequence labeled JUMP, where an integer code (0, 1, 2, etc.) 
would be converted into quadword-spaced displacements (0, 8, 16, etc.) added to the 
starting address of the jump table. The 1dq instruction picks out the appropriate target 
address. The jmp instruction stores in register R26 the address corresponding to the 
symbol JOIN. 

Each of the code segments that perform the intended processing for the various 
cases would end with a return jump using the preserved value in register R26. In actual 
practice, the “cases” might be put into a different code section, which the OpenVMS 
linker would then “move out of the way” so that the main program flow could proceed 
from the sequence at JOIN without need for any branch-beyond instruction. 

Again here at the assembly language level, we have gained an advantage in read- 
ability and maintainability through the jump table construct. Adding a new case would 
only require three coordinated modifications: adding one more address to the jump 
table, elaborating the testing logic to recognize the new case and assign its jump code, 
and writing the corresponding new “what to do” program segment. 

We can think of cases and jump tables as an exercise in set associations. A set of 
actual, perhaps very arbitrary, cases is associated with a set of integers. Those integers 
are associated with a set of corresponding addresses. Those addresses are associated 
with a set of corresponding “what to do” program segments. At an overview level, there 
is an expression of the case structure from the label CASE to the label JOIN, lending 
itself to a focus on the critical testing logic. The details for each case are handled else- 
where (at the labels CASEn). The properties of the jump instructions and the jump table 
tie things together. 


Subroutine Calls 


Building a large program from smaller units that can be debugged separately can bring 
about valuable improvements in reliability, readability, maintainability, and program- 
mer productivity. Moreover, the division of programs into subroutines has historically 
served to conserve memory taken up by the program. Replacing every occurrence of a 
repetitive sequence of instructions with a jump to subroutine instruction, and then add- 
ing a return instruction at the end of a unique instance of that sequence of instructions 
can save memory. 

The countervailing factor has always been concern for execution speed. With a 
RISC architecture, considerations of run-time performance may argue for repeating 
instruction sequences wherever needed, especially those that are only of modest length, 
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instead of encapsulating them as a subroutine. Nevertheless, the repeated instances can 
be maintained in common at the source level either through the use of the include 
capability of an assembler or through the use of the macro capability offered by some 
assemblers, which we take up in a later chapter. On the other hand, copies of code seg- 
ments at different addresses cannot be recognized in a content-sensitive fashion by an 
ordinary instruction cache system, whereas a small repetitively invoked routine might 
stay in the cache and thus contribute to good overall performance. 

The Alpha architecture provides the jsr/ret instruction pair for the purpose of 
supporting traditional subroutine calls. Just as the jmp instruction supplements the br 
instruction in providing a wider addressing range, so too the jsr instruction makes 
possible subroutine calls across the full 64-bit address space. The attendant cost, how- 
ever, also involves somehow getting the target address loaded into the register called Rb 
in the discussion of the jmp instruction that we gave earlier: 


Calling Routine E R i 
Instruction sequence 
load address X into Rb 
jsr Ra, (Rb) 
Z: Entry point 
Instruction sequence 
ret R31, (Ra) 
Resumption point 


Additional overhead will be involved if the subroutine must support recursion, since the 
linkage register Ra must then be preserved and restored on a stack. 

We elect not to present a specific programming example of hand-coded usage of 
the jsr and ret instructions at this point. Such an illustration would be very similar to 
NUMOUT and TESTNUM (Figures 7.3 and 7.4). Instead, we will soon lift the veil of 
mystery from the system-supplied macros $call and $return (OpenVMS), which 
themselves produce jsr and ret instructions but also afford us certain conveniences 
regarding the treatment of entry point address, the preservation of registers, and support 
for recursion. We then return to j sr and ret and the calling conventions for the Unix 
programming environment, where such macros are not provided. 


Coroutines 


The Alpha architecture also includes the j sr_coroutine instruction. With an ordi- 
nary hierarchical structure of a main program and a subroutine, the subroutine always 
begins and ends for every jump to subroutine instruction, while the main routine 
resumes after each return from subroutine instruction. The relationship between the 
main program and the subroutine is thus asymmetric. 
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Suppose instead that two routines could be designed in such a manner that they 
would call each other in a symmetrical manner, as coroutines. Whenever the one leaves 
off, the other resumes: 


Coroutine A Coroutine B 

lda Rb with address X 

jsr_coroutine R26, (Rb) 
X: Entry point 
Instruction sequence 
jsr_coroutine R26, (R26) 

Resumption point 

Instruction sequence 

jsr_coroutine R26, (R26) 
Resumption point 
Instruction sequence 
jsr_coroutine R26, (R26) 

Resumption point 

etc. etc. 


Coroutines can continue indefinitely to call each other in alternation. Register R26 suc- 
cessively stores a new updated PC return address with each jsr_coroutine instruc- 
tion encountered. Note that there is not necessarily any stack activity, and that 
coroutines A and B communally share whatever information is contained in all the reg- 
isters. The overall task is divided between the two routines. 

The interrupt enable routine and the interrupt handling routine in the RT-11 oper- 
ating system for PDP-11 computers interacted using a scheme resembling coroutines. 
Other applications of coroutines have been suggested by various authors, including 
Schechter and Stallings. Coroutines might be designed in such a way as to avoid mak- 
ing multiple passes over a large data file, yet retain the conceptual advantage of com- 
partmentalized program code for each type of specialized processing. A pair of 
coroutines might also be envisioned as players for Black and for White in a program 
that demonstrates the game of chess. 


Conventions for Register Use 


Efficient development of software requires considerable standardization, not only to 
minimize confusion and maximize understandability, but also to lessen duplicative 
machine activity when saving and restoring the contents of registers. Considerable 
potential for inefficiency arises when libraries of subroutines are to be called upon. We 
have already seen how the NUMOUT subroutine was designed conservatively to save 
three registers that it would use, but that the TESTNUM calling program did not actu- 
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ally require preservation of the contents of those registers for its own correct function- 


ing upon return from NUMOUT. Conversely, if a caller such as TESTNUM were to 


save copies of the contents from all the registers it had in active use, there would be no 


assurance that a called routine such as NUMOUT would need to use all such registers. 


This conundrum can be partially mitigated by conventions for register use, in combina- 


tion with standardized call/return sequences. 


Certain conventions for usage of the 64 general purpose registers of the Alpha archi- 


tecture have been established for the various programming environments (Table 7.2). 


Table 7.2 


9-14 
15 


16-21 


22—24 
25 


26 


27 


28 


29 


30 
31 


Alpha Register Usage Conventions 


Unix NT 
$0 RO 
$1 RI 

$2-$8 R2-R8 
$9-$14 R9-R14 


$15 or $fp R15 or FP 


$16-$21 R16-R21 
$22-$24 R22-R24 
$25 R25 
$26 R26 or RA 
$27 
R27 
$28 or $at R28 


$29 or $gp R29 or GP 


$30 or $sp R30 or SP 
$31 R31 


VMS 
RO 
Rl 


R2-R8 
R9-R14 


R15 


R16-R21 
R22-R24 


R25 
R26 


R27 
R28 


R29 or FP 


R30 or SP 
R31 


Usage Convention 
Integer function result 
Conventional scratch register 
Conventional scratch registers 


General uses 


General uses 
Frame pointer, else saved 


General use 


Integer arguments by value 


Integer arguments, variously 


Conventional scratch registers 
Conventional scratch register 
Argument information (AI) 
Return address register 
Procedure value (pointer) 


Conventional scratch register 
Procedure value (PV) 


Reserved for use by system 
software 


Global pointer 


Stack frame pointer 


Stack pointer 


Zero (not modifiable) 


Saved 


Yes 


Yes 
n/a 
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Table 7.2 Alpha Register Usage Conventions (continued) 


Unix NT VMS Usage Convention Saved 
0 $f0 FO FO Floating point function result No 
$f1 Fl F1 Imaginary part function result No 
2-9 $f2-$f9 F2-F9 F2-F9 General uses Yes 


10-15 $f10-$f15 F10-F15 F10-F15 Conventional scratch registers No 
16-21 $f16-$f21 F16-F21 F16-F21 Floating point arguments by No 


value 
22-30 $f22-$f30 F22-F30 F22-F30 Conventional scratch registers No 
Si $f31 F31 F31 Zero (not modifiable) n/a 


At the level of the hardware, only registers R31 and F31 are really distinctive. 
These particular registers always yield zero when referenced semantically as a source of 
data. Although the digital logic circuits comprising registers R31 and F31 make them 
“read only” elements, the assembler nevertheless permits these registers to be used with 
destination semantics in instructions: they then appear to act as sink-holes or “bit buck- 
ets” for unwanted data. 


Some registers must always be saved and restored if they are used by a called pro- 
cedure. Several of the higher-numbered integer registers are allocated for contextual 
information, e.g., the principal stack pointer. Certain other integer registers and float- 
ing-point registers have standardized uses for communicating arguments or computed 
function values between calling and called procedures. The remaining registers may be 
freely used by any called routine for scratch purposes, but with the clear warning that 
information in them is volatile and may be overwritten if the routine calls one at a 
deeper level. Every level of program segmentation must abide by all of the conventions. 


At an even higher level, the operating system environment not only respects the 
contextual information but also all of the temporary storage of a user process. That is, 
except for register R28, information even in the registers marked for scratch use in 
Table 7.2 will not be overwritten when the operating system swaps user processes on a 
timesharing system or responds to hardware interrupts on behalf of the current process, 
a dormant process, or the operating system itself. This distinction is important: proce- 
dure calls occur synchronously at “convenient” times and must preserve all agreed- 
upon things; system routines that respond to asynchronous events have no way to evalu- 
ate “importance” of information and thus are obliged to respect and save everything, 
every time. 
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Passing Arguments to Procedures 


For the rest of this chapter we will turn to the topic of procedures, which are more gen- 
eral and versatile than subroutines. Procedures, like subroutines, help to segment large 
programs into well-conceived, potentially reusable modules that can be developed, 
tested, and maintained independently. In most programming environments, procedures 
have more stringent and more formally specified inputs, actions, outputs, and conven- 
tions for calls and returns. Although some overhead is associated with procedure calls, 
the software engineers who design programming environments aim to maximize func- 
tionality and minimize needless overhead. 

The inputs and outputs of procedures are most often called arguments. In a high- 
level language, the arguments may correspond to an ordered list of expressions 
enclosed in an outer set of parentheses: 


do_something( exprl, expr2, expr3, error ) 


Procedures are often designed for fixed numbers and types of arguments, but sometimes 
a procedure may need to be informed about the total number of arguments, e.g., the 
minimum value function in FORTRAN: 


MIN( exprl, expr2, ... exprN ) 


The manner in which the number of arguments or the data type of each argument is 
made known to the called procedure will vary with the programming environment. 

In any programming environment for the Alpha, the first argument will be passed 
in register R16 or register F16, the second argument in R17 or F17, ... and the sixth 
argument in R21 or F21. If there are more than six arguments, the others are put onto 
the SP stack in a manner which we describe a little later on. 


Argument Information Register (OpenVMS) 


In the OpenVMS programming environment for Alpha systems, a calling procedure 
informs a called procedure about arguments through an argument information (AI) reg- 
ister; this capability is assigned to register R25. 

The argument information register is used not as a single quadword but as a com- 
posite of fields and subfields: 


63 26 25 8 7 0 


The lowest byte AI<7:0> specifies the total number of 64-bit argument items being 
passed. The first six items are always passed in the general registers, and all additional 
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arguments are passed on the stack. For each of those six items, AI<25:8> allocates 3 
bits as a code. For example, bits <10:8> describe the first argument. A code of 0 indi- 
cates either no argument or an integer argument passed in register R16, while a code of 
4 or 5 indicates a floating-point argument passed in register F16 having S_floating or 
T_floating data type, respectively. Codes 1 through 3 indicate VAX-compatible float- 
ing-point data types that are not treated in this book, while codes 6 and 7 are reserved to 
the vendor. 


Fortunately, we shall not need to perform the bit-encoding for the AI register con- 
tents by ourselves. The system-supplied $cal1 macro and the assembler will attend to 
those details for us. We defer discussion of the placement of a seventh or further argu- 
ments on the stack until we take up the $call macro below. 


Argument Passing Methods 


A typical calling standard, which may apply across the full range of programming lan- 
guages, provides that each item in an argument list may belong to one of three classes 
according to its “immediacy” in an analogous sense to our earlier discussion of address- 
ing modes: 


e Immediate value. The quadword item in the register (or stack location) is the 
actual argument itself. Only a scalar quantity or single array element, i.e., an inte- 
ger or real floating-point value, can be passed by value. 


° Reference. The quadword item in the register (or stack location) is the address of a 
data item, which may be a scalar, a string, an array, a record, or a procedure. Other 
arguments may be associated with an item passed by reference, such as dimen- 
sionality information for an array or the length for a string. 


° Descriptor. The quadword item in the register (or stack location) is the address of 
a descriptor, a standardized data structure that contains not only an actual address 
of the data but also appropriate information about how the data have been stored 
and can be accessed. Descriptors contain such information as array bounds and 
string lengths. 


That is, we say that each individual argument is being passed by the caller to the called 
procedure by value, by reference, or by descriptor. Note the strong parallel to immedi- 
ate, direct, and deferred addressing modes. 


The default method for passing arguments in the OpenVMS programming envi- 
ronment differs for various high-level languages (Table 7.3), but the syntax of each lan- 
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guage also provides for overriding the default. Details can be found in the appropriate 
user’s manual for the language. 


Table 7.3 High-Level Language Syntax for Passing Arguments (OpenVMS) 


By Value By Reference By Descriptor 
BASIC argument BY VALUE argument BY REF argument BY DESC 
(numeric default) (string default) 
C argument &argument define a structure, then: 
(usual default) &struct_name 
FORTRAN %VAL(argument) %REF(argument) %DESCR(argument) 
(numeric default) (string default) 
Pascal %IMMED argument %REF argument %DESCR numeric 
(usual default) %STDESCR string 
COBOL argument BY VALUE argument argument 
BY REFERENCE BY DESCRIPTOR 


(usual default) 


I ee 


Argument passing in the Unix programming environment follows the default con- 
vention for the C language, i.e., passing by immediate value. Passing by reference can 
also be designed for a particular procedure, if desired, since a full address fits into the 
same space (64 bits) as a quadword integer value. 


String Descriptors (OpenVMS) 


Since strings vary in length, any routine that works with strings needs some informa- 
tion about the number of characters contained in a string. That information may be 
implicit, as when the .asciz directive produces a terminated string by appending 
the special NUL value to the string as a one-byte end marker. The length may instead 
be stored explicitly alongside the string, as when another directive (.ascic) pro- 
duces a counted string in which one unsigned count byte is prefixed to the string that 
is stored in subsequent bytes. The OpenVMS assembler also provides the .ascid 
directive for accompanying a stored string with an associated descriptor. That latter 
directive produces a VAX-compatible descriptor containing only a 32-byte address, 
however. We shall describe instead the newer Alpha-specific descriptor format con- 
taining a full 64-bit address. 
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The general descriptor format for 64-bit addressing in OpenVMS always consists 
of at least three aligned quadwords (additional quadwords are needed for the descriptors 
of arrays): 


63 32 31 24 23 16 15 0 






where type is an unsigned byte code denoting the type of data element being described, 
and where class provides further information about how the data element should be 
treated. 

Strings are assigned as type = 14 (decimal). A static string having a fixed length is 
assigned as class = 1, while a dynamic string is assigned as class = 2. We can build the 
descriptor for a static string as follows: 


STR: -ascii "The string to be stored" 
$$$ =. 
.align quad 

DESCR: .word 1 
. byte 14 ; String data 
. byte 1 ; Fixed length 
. Long -1 

LEN: . quad <$S$ = STR> ; Length 

LOC: .address STR ; Pointer 


where we allow the assembler to compute the length for us by evaluating an expression 
<$$$ - STR> containing a temporary symbol that retains the location counter value 
of the byte just after the last stored character of the string. The .align directive 
ensures that the descriptor beginning at symbolic location DESCR will be aligned to a 
natural quadword addressing boundary. 

A dynamic string would initially be set up the same way, with class = 2, but dur- 
ing program execution its length (LEN) or its location (LOC) might change. Strings that 
come as input at run time are usually implemented as dynamic strings, as we shall see 
in a later chapter on input and output support. 


Types of Alpha Procedures 


A well-designed programming environment provides clear guidelines for the way that 
procedure calls and returns should take place. The essential context of the caller must be 
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preserved. Context includes the subset of registers that must be saved (see Table 7.2), 
the return address, and the caller’s view of the system-wide stack. 

Preservation of context, such as saving register contents, can be done by a calling 
routine or a called routine. Since there would be wasteful overhead if register saving 
were repeated both before the actual j sr instruction (by the caller) and after it (by the 
called routine), one method or the other is selected for a standardized environment. 

For the Alpha, the responsibility for saving the registers designated with “Yes” in 
Table 7.2 is built into the expectations for prologue and epilogue segments associated 
with the called procedure. That is, a called routine that will use any of the registers 
marked “Yes” must save them during its prologue and restore them during its epilogue. 
For the OpenVMS assembler, the system-supplied $routine and $return macros 
facilitate such saving and restoring of register contents as well as appropriate attention 
to certain other details of context preservation. In the Unix environment, the compiler or 
the assembly language programmer must attend to such details. 

Conversely, the responsibility for saving data in any registers designated with 
“No” in Table 7.2 falls to the calling routine, whose designer should clearly know 
whether such data would be needed again after a called procedure returns control to the 
caller. Such divided responsibility between the caller and the called tends to minimize 
needless save/restore operations and to maximize the overall usefulness of the fixed 
number of registers. 

Alpha assembly language programmers and compiler writers have a choice of three 
standardized types of procedures that differ in how the caller’s context is preserved: 


e Stack frame procedures. The caller’s context is pushed onto the stack in a pre- 
scribed way when the procedure is called, and later restored from the stack when 
the procedure returns. A stack frame procedure is colloquially called “heavy 
weight” because it has the most overhead, but supports the fullest set of features. 

e Register frame procedures. The caller’s context is saved in scratch registers in a 
prescribed way when the procedure is called, and later restored from those regis- 
ters when the procedure returns. A register frame procedure is colloquially called 
“light weight” because it entails less overhead but supports fewer features. 

e Null frame procedures. The called routine shares and fully respects the caller’s 
context. A null frame procedure imposes the least overhead but provides the few- 
est capabilities. 


Null frame procedures resemble our manually produced NUMOUT local subroutine. 
Both null frame and register frame procedures are best suited for “leaf” or “end of the 


calling chain” procedures. Unless procedures of these two types contain ad hoc saving 
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and restoring of intermediate results that scratch registers may contain, they are unsuit- 
able for applications involving recursion. Stack frame procedures, which are more obvi- 
ously designed to support recursion, will occupy most of our attention. 


Procedure Descriptors 


The base register addressing used in the Alpha architecture, with only a 16-bit displace- 
ment field in load and store instructions, poses a potential problem. How can a program 
or routine “see” or “reach” stored items such as constants or important pointers to its 
own static data structures wherever the linker may have located them? Clearly at least 
one master pointer is needed. By convention, register R27 is designated to hold the pro- 
cedure value which uniquely characterizes one procedure from all others. 


For Digital Unix and Windows NT for Alpha the procedure value is the address of a 
procedure’s entry code. The caller must have set this address into register R27 right before 
using a j Sr instruction to call the procedure. Through a combination of actions, the com- 
piler or assembler, linker, and loader also establish a procedure descriptor which is of use 
to the system in processing errors or other anomalous occurrences, but a procedure does 
not generally need access to its own procedure descriptor in those environments. 


For OpenVMS, the procedure value is not the entry address, but rather the address 
of a procedure descriptor that is contained within a memory region associated with 
each procedure called its linkage section. OpenVMS procedure descriptors have vari- 
able lengths depending on the procedure type and other factors. At a minimum, the pro- 
cedure descriptor is 16, 24, or 32 bytes in size for null frame, register frame, or stack 
frame procedures, respectively. The first quadword (i.e., first 8 bytes) contains numer- 
ous flags and other characteristics describing the procedure and its type. The second 
quadword is an address pointer to the entry point in the code section for the procedure. 
One way that an OpenVMS routine could be called is thus as follows: 


ldg R27,X_proc_descr ; R27 -> proc descriptor for x 
ldq R26,8(R27) ; R26 -> code for x 
jsr R26, (R26) » Cali x 


The sequence actually generated by the $ca11 macro will typically be based on differ- 
ent addressing information in the linkage section that requires the same number of 
instructions, but improves typical execution efficiency. The called routine can again use 
the address in register R27 (i.e., its procedure value in the OpenVMS sense) as a base 
register to assist in locating the linkage section and other data, as in all of our illustra- 
tive programs. 
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Contrast this calling sequence with a Unix calling sequence, for instance the call- 
ing of chrget in the IO_C program. We can use /ni and /xx display modes of dbx 
to learn what actual machine instructions were produced for the j sr instruction that we 
wrote ourselves. Those actual instructions are: 


ldq $27,-32664 (Sgp) # Get address of chrget 
Tex $26, ($27) # Do the actual subroutine call 


We see that the assembler loaded register R27 with the entry point address of the routine 
(i.e., the procedure value in the Unix sense), which had been stored by the assembler and 
linker at a particular offset from the global pointer. Thus the called routine will inherit a 
copy of its own entry point address in that register, which it can then use with the 1dgp 
pseudo-instruction, as in all of our illustrative programs to load the $gp register. 

The information in a procedure descriptor can be used by run-time components 
associated with high-level languages for error handling and “‘traceback” when unrecov- 
erable faults occur. These make possible the production of a cascade of messages say- 
ing that a problem was encountered at some location in subroutine Z, called from some 
location in subroutine Y, called from some location in subroutine X, called from some 
location in MAIN. If all of these routines use the stack properly, there is sufficient infor- 
mation stored on the stack and in the procedure descriptors of MAIN, X, Y, and Z for 
the high-level language error handlers or those in the operating system itself to 
“unwind” the nested procedure calls all the way back to MAIN or to the operating sys- 
tem. Explaining how to do this would lie well beyond the scope of this book. If you are 
interested in looking into this topic in greater depth, see the manual appropriate to the 
calling standard for the particular programming environment. 


How Programs Are Started by Operating Systems 


Full programs are considered to be “procedures” in nearly all respects by an operating 
system itself. When we ask the system to run a program, a loader first copies at least 
some header information from disk storage into memory. With virtual addressing and 
paging memory systems now almost universal, it is not strictly correct to say that the 
entirety of a program is moved into memory all at once. Rather, various pages or chunks 
of the program are brought into memory as needed, and in that way the entire program 
(on disk) can be larger than the amount of available physical memory. 

The system puts the program’s procedure value into register R27, puts an appro- 
priate return address into register R26, and transfers control to the start address of main 
(for Unix) or to whatever address was specified symbolically on the . end statement 
(for OpenVMS) by putting the corresponding virtual address into the program counter. 


Types of Alpha Procedures 207 


This similarity between a main program and a procedure means that the experi- 
ence you have already gained in writing simple programs will readily transfer to com- 
prehending how to write procedures and functions. 


Data Section Pointer (OpenVMS) 


When we use the $routine macro (see below) to define a program or procedure, 
three memory regions are produced: a code section, a data section, and the linkage sec- 
tion. Symbols $cs, $ds, and $1s have values equating to the origins of these program 
sections as ultimately assigned by the linker. The procedure descriptor is physically 
located in the linkage section. In all our examples, the linkage section also contains a 
full 64-bit address pointer to the data section at symbolic location $dp because we 
always specify data_section_pointer = true in our use of the Sroutine 
macro (see discussion below associated with Table 7.4). 

The important point is that register R27 not only points directly to the procedure 
descriptor, but also serves as a convenient base register for accessing all other sorts of 
information stored nearby in the linkage section, of which the procedure descriptor is 
just one part. For our purposes, the chief focus is to be able to address the information 
stored in the data section. All of our programs have begun with the sequence: 


Scode_section 


. base R27,Sils ; R27 -> linkage section 
ldg R15,Sdp ; R15 -> data section 
. base R15,Sds ; Tell MACRO about this 


In effect, we instruct the assembler to use register R27 as the base register for locating 
the address of the data section. That address is stored for us by the linker at symbolic 
location $dp in the linkage section because we specified data_section_pointer 
= true for the S$routine macro. The assembler can develop the appropriate dis- 
placement addressing for the 1dq_ instruction, namely ldg R15,<Sdp- 
Sila (R27): 


Slinkage_section Sdata_section 
Sls: vee Sds: 


Sdp: .address Sds label: 


After we have loaded the address of the data section into register R15, we inform the 
assembler that this register can be used as a base register for accessing the whole data 
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section (up to 32K in one direction). We have chosen register R15 because it is guaran- 
teed to be a saved register (see Table 7.2). 

References to data at specific labels within the data section will then be generated 
with displacement addressing in the form <label-$ds> (R15). So long as a pro- 
gram’s data section is of moderate size, the span of Alpha displacement addressing per- 
mits the assembler to build an address of the form disp (R15) referring to any 
information unit. Otherwise, the 1dah instruction (Chapter 4) can be used to load 
address pointers that are multiples of 65,536 (somewhat resembling the use of base reg- 
ister addressing with the Intel 80x86 and Pentium architectures). As all of the example 
programs in this book will be rather modest in size, we will not have need to develop a 
scheme for dealing with large amounts of data storage. 


Global Pointer (Unix) 


The Unix assembler and linker place necessary address pointers to support symbolic 
addressing into the data section. The global pointer register ($gp) can typically “reach” 
such pointers using displacement mode. Then actual 1da, 1dq, or j sr instructions are 
treated as “macros” and are expanded, if necessary, into two steps. In the first step, a 
machine instruction using disp ($gp) loads the pointer either into the target register 
or a scratch register. In a second step, indirect addressing (zero displacement) is used 
with that register to accomplish the desired operation. 

The global pointer is not a saved register (Table 7.2). The linker may not be able to 
compose the data section in such a way that all routines would be able to use the same 
value in the $gp register. Accordingly, a calling routine must use one of three tech- 
niques for resetting its own global pointer value: 


e reload $gp using an 1dgp instruction; 

e move $gp into one of the registers guaranteed to be saved across a call (Table 7.2) 
and then restore it; or 

e allocate a temporary “variable” as part of the stack frame (described later in this 
chapter) to save/restore $gp. 


The vendor manuals illustrate the first of these methods, which we are using consis- 
tently in our sample programs. 


Standard Prologue and Epilogue 


Properly written Alpha programs and procedures always begin and end in well-defined 
ways called the standard prologue and standard epilogue. These prescribed patterns of 
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instructions and assembler directives support not only normal calls and returns, but also 
the stringent requirements for debugging and error recovery by the system software. 
The prologue and epilogue are typically formatted using different assembler capabili- 
ties for each programming environment. You have already seen that the tops and bot- 
toms of our sample programs seem quite different for OpenVMS and Unix variants, 
while the central parts are very similar. Actually, the net effects of the different assem- 
bler directives and macros produce somewhat similar prologues and epilogues. 


The $routine, $return, and $end_routine Macros (OpenVMS) 


We hope that you have noticed—and have been curious about—the fact that the appar- 
ent line numbers in a listing file produced by MACRO-64 take a jump when some of 
the system-supplied macros are used (see Figures 1.2 and 3.3). The biggest jump (over 
2000 lines) occurs with $routine. Unlike the VAX and PDP-1] assemblers, 
MACRO-4 counts not only the physical lines in a source .M64 file but also all lines in 
system-supplied macros that have to be processed during the assembly process. Some 
of those system-supplied macros are very complex, and they use advanced techniques 
that we will not fully treat in our own chapter on using macros. 


The S$routine macro, which also involves the internal use of certain other Sys- 
tem-supplied macros, is designed to assist the assembly language programmer with set- 
ting up any of the three types of procedures. It also has the capability to set up routines 
with VAX compatibility, but we will not discuss that complicating aspect. In all, 
Sroutine has more than 30 possible parameters, some of which have simple values 
(e.g., true/false) while others may entail lists. A majority of the parameters are 
optional or have defaults that specify the most useful behavior. Table 7.4 describes all of 
the parameters that we have used thus far in this book, as well as a few others that may 
be helpful to you. 


Table 7.4 Parameters for the $routine Macro 


Parameter Default Description 


name none Name of the routine (required). The "name=" 
can be omitted if the routine name is given as the 
first parameter. 


data_section_pointer false When true, a pointer to the data section is 
inserted in the linkage section and Sdp is defined 
as the address of that pointer. 


kind null Type of routine: bound (not discussed in this 
book), stack, register, ornull. 
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Table 7.4 Parameters for the $routine Macro (continued) 
Default 


Parameter 


save_fp 


Ssave_ra 


standard_prologue 


saved_regs 


none 


R26 


true 


FP (R29) 
for stack 
routine 


Description | 
Name of register to which FP (R29) is copied in 
the prologue and from which FP is restored in the 
epilogue. Only required or valid for register 
routines. 


Name of register containing the return address, if 
not R26. Only required or valid for register 
routines. 


Generates a standard instruction prologue 
sequence including register saving and stack 
management. Valid only for stack and regis- 
ter routines. 


List of registers to be saved on the stack (stack 
routines only). Registers RO, R1, R28, R30 (SP), 
and R31 should not be included. Note that 
Sroutine always saves the return address and 
the frame pointer FP (R29). 


We also hope that you have noticed—and have been curious about—the fact that the 
first apparent instruction of a program is not aligned at a zero value of the location counter 
shown in an assembly listing for the code section (look again at Figures 1.2 and 3.3). By 
default for stack routines, $rout ine produces a standard prologue, and $return then 
produces a standard epilogue. For the example of SQUARES2?, these are as follows: 


lda 
stq 
stq 
stq 
stq 
stq 
mov 


SP, -48 (SP) 
R27,0(SP) 
R26,8(SP) 
R2,16(SP) 
R15,24(SP) 
FP,32SP) 
SPPP 


G 
1 
: 
/ 


/ 


<body of the routine> 


mov 
ldg 
ldg 
ldg 
ldg 
lda 
ret 


FP SE 
R28,8(SP) 
R2,16 (SP) 
R15,24(SP) 
FP; 32 (SP) 
SP,48(SP) 
R31, (R28) 


> 
/ 
. 
1 
. 
1 
2 
1 
. 
1 


U 


Allocate fixed-stack area 


Save 
Save 
Save 
Save 
Save 


procedure value 
return address 
designated register 
designated register 
caller's FP 


Establish current frame 


Release any variable-stack area 
Return address to scratch register 
Restore designated register 
Restore designated register 
Restore caller's FP 

Release fixed-stack area 

Return to caller 


The standard epilogue uses scratch register R28 for returning to the caller. Registers 
R26 and R27 are not preserved and are, in fact, available within the body of the routine 
as scratch registers (see Table 7.2). 
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If the logic of a routine has multiple points at which it might exit, Sreturn may 
be used at those several places. It is not necessary to branch to a single joining point. 
The only cost would be the rather small amount of memory taken up by the identical 
copies of the standard epilogue sequence. 

By contrast, a single occurrence of the Send_routine macro is needed as a 
syntactic marker for the end of the scope of a particular Srout ine macro (see Figure 
7.2). This requirement parallels the role of . end as a marker for the end of an entire 
assembly unit. 


Prologue and Epilogue for Unix 


The Unix assembler is not accompanied by comprehensive macros like those that 
MACRO-—64 affords in the OpenVMS programming environment. Instead, the vendor 
manuals outline a stepwise “cookbook” for constructing prologues and epilogues: 


1. Use the . ent directive and a matching entry label for the procedure. 
2. Load the $gp register. 


3. Allocate stack space for all purposes (saving registers, seventh and additional 
arguments, local variables, etc.) just once using a mandatory multiple of 16 bytes. 


4. Include a . frame directive. 


5. Save the appropriate registers and give summary bit masks in . mask (integer) 
and . fmask (floating-point) directives. Use only stq and stt instructions. 


6. Mark the end of the prologue with a . prologue directive. 
The epilogue proceeds in reverse: 


1. Restore registers. Use only 1dq and 1dt instructions. 
2. Get the return address. 

3. Release stack space. 

4. Put an exit code in register RO (not always required). 
5. Provide a ret instruction. 

6. Conclude with a . end directive. 


In our programs, we will use several symbolic values to compute the total size of the 
stack allocation. We introduced STACK and FRAME with the IO_C program (Figure 
6.7b). Others will be introduced when we discuss local variables below. 
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Frame Pointer and Local Variables 


High-level language compilers generally distinguish between global variables and local 
variables. While the storage for global variables should be unique, the anticipation of 
recursion requires that storage for local variables be provided afresh for each instance, 
or call, of a procedure. When a programming environment provides for a suitably large 
addressing range for the stack, such storage can be allocated at the beginning of a rou- 
tine by claiming several (or even a great many) stack locations. 


Stack Organization for OpenVMS 


The designation of one register as a frame pointer (R29 for Alpha or R13 for VAX) 
leads to a convenient scheme for such local variables. The total space required for N 
quadwords can be claimed from the stack with a single 1da instruction all at once, i.e., 
right after the standard prologue: 


lda SP,-<8*N>(SP) ; Claim space for N quadwords 


Then these N quadwords can be addressed individually using symbols defined as positive 
offsets from SP, the stack pointer, e.g., <8*V> (SP) with V = 0, 1, 2. Equally well, they 
can be addressed using symbols defined as negative offsets from FP, the frame pointer, 
e.g., -<8*V> (FP) with V = 1, 2, 3. This latter method is convenient because the proce- 
dure may then freely use the stack for other incidental purposes (OpenVMS only), while 
the procedure-wide local variables will maintain fixed addressing relationships to FP. 

The overall appearance of the stack for an OpenVMS stack procedure with a stan- 
dard prologue which has just claimed space for three local variables will have this sche- 
matic organization: 


: 






: SIZE(FP) 














RSA_OFFSET(FP) 










(FP) 

(FP) 

OP 

16(FP) 

: -24(FP) and : 0(SP) right now 
x(SP) late 
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We have drawn this sketch with address values increasing upward. Similar diagrams in 
the vendor manuals may be drawn upside down relative to this. Any incoming argu- 
ments beyond the first six, which are always passed in registers, will have been placed 
on the stack by the caller (the $ca11 macro does this). Note that Sroutine creates a 
symbolic offset SIZE in order that a called routine be able to set some register to point 
to the extended argument list. Furthermore, the routine does not explicitly have to 
release the space claimed for its three local variables with a separate 1da instruction, if 
there is a standard epilogue produced by $return. The epilogue already contains as 
its first instruction a resetting of the stack pointer (using the frame pointer). 

Any lists of seventh or further arguments for calling other procedures from this 
procedure occur as part of the “incidental stack usage” region on this diagram. 

Important: The value of the frame pointer must never change throughout the 
whole course of execution of an OpenVMS procedure. The frame pointer defines the 
context of the procedure by pointing to its local storage, its incoming arguments beyond 
the first six, its own procedure value, and the caller’s return address. 


Stack Organization for Unix 


The designation of one register (R15) as a frame pointer is optional in the Unix pro- 
gramming environment, but we will do so because it leads to a convenient addressing 
scheme for local variables that will permit us to maintain a close parallel between the 
OpenVMS and Unix variants of our program examples. Note that register R15 is 
marked “Yes” in Table 7.2 and is thus preserved across procedure calls. 

In order to bridge between our OpenVMS examples and the discussions in the 
vendor manuals for Unix, it will be helpful to introduce some symbolic assembler val- 
ues for a particular program routine, as follows: 


VARS = nl # number of local variables 

REGS = nz # number of saved registers (including R26) 
ARGS = n3 # maximum number of arguments (beyond 6) 
STACK = VARS + REGS + ARGS # total quadwords 

FRAME = ((STACK*8+8) /16) *16 # bytes (multiple of 16) 
OFFSET = -FRAME + 8*ARGS # (negative) offset to saved reg 


and to locate the two instructions 


lda $sp, -FRAME (Ssp) # Allocate stack space 
lda Sfp, FRAME ($sp) # Define a frame pointer 


in the prologue and just after the prologue, respectively. 
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Then the individual quadwords for the local variables can be addressed indi- 
vidually using symbols defined as negative offsets from FP, the frame pointer, e.g., 
-<8*V> (FP) with V = 1, 2, 3. That is, any particular local variable has a fixed 
addressing relationship to FP. The compiler for a high-level language may instead 
use a known positive offset from SP to achieve the same result. 


The overall appearance of the stack for a Unix stack procedure with a standard 
prologue which has just claimed space for three local variables will have this schematic 
organization: 










Arguments (beyond the first six) : FRAME(SP) or : O(FP) 





Local variable 1 : -8(FP) 
Local variable 2 : -16(FP) 
Local variable 3 : -24(FP) 


...more variables or registers? 


Other saved registers : 8*ARGS+8(SP) etc. 


Saved return address : 8*ARGS(SP) 
...more arguments? 
Argument build area : O(SP) etc. 


We have drawn this sketch with address values increasing upward. 


The maximal extent of any lists of seventh or further arguments for calling other 
procedures from this procedure must be allocated for the “argument build” region on 
this diagram. 


Integer Division on the Alpha 


We have previously remarked (Chapter 4) that the RISC design for the Alpha architec- 
ture lacks any hardware divide instruction for integer data. We presented a partial rem- 
edy using multiplication by the reciprocal of a number (Chapter 5), but that works only 
case by case where the divisor is essentially known at the time an assembly language 
routine is being written. The completely general case obviously requires better compen- 
sation through system-supplied software for what is missing at the hardware level. 


Many interesting algorithms, including the linear congruential method for produc- 
ing somewhat randomly drawn integers, require a modulo, remainder, or division oper- 
ation. How can such operations be accomplished on a computer lacking a machine 
instruction for integer division? 
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The accommodations provided in the OpenVMS and Unix programming environ- 
ments are different. Insofar as possible, however, we are going to allocate registers quite 
deliberately for any of our future example programs that may involve quotients or 
remainders, in order to minimize the number of visual differences between variants of 
those programs. 


Routines Related to Division (OpenVMS) 


The OpenVMS system software includes many groups of system routines that have 
names of the form grp$name and that conform to the standard calling convention. 
Most of those routines are described in various programmer-oriented manuals. Usually 
each argument must be passed in only one prescribed way. 


Some routines that are only sketchily mentioned in the vendor manuals are listed 
in Table 7.5 because they can compensate for the hypothetical “diva” machine 
instruction. We are about to illustrate the use of ot s$rem_u1. This routine behaves as 
a function, taking two arguments passed by value in registers R16 and R17 and return- 
ing a result in register RO. 


Table 7.5 Integer Divide and Remainder Routines (OpenVMS) 
Name Quadword Value Returned in Register RO 
otsSdiv_l Quotient from arg1/arg2 (signed quantities) 
ots$div_ul Quotient from arg1/arg2 (unsigned quantities) 
otssrem_l Remainder from arg 1/arg2 (signed quantities) 
otsSrem_ul Remainder from arg1/arg2 (unsigned quantities) 


These routines operate with 64-bit (quadword) arguments passed by value and return 64-bit results. Similar 
routines with names ending with _i and _ui operate with longword values passed physically as sign- 
extended and zero-extended 64-bit quantities, respectively. 


_—_— rr — ees 


Pseudo-instructions for Division and Remainders (Unix) 


The Unix assembler recognizes a larger number of pseudo-instructions than MACRO- 
64 for OpenVMS. Those include a full set of instructions that support division and the 
calculation of remainders for longword and quadword operands: 


Givtx Ra, Rb,Rc # Rc <— Ra/Rb 
divit Ra, Lit, Re # Rc <— Ra/lit 
remtx Ra, Rb,Rc # Rc = remainder from Ra/Rb 


remtx Ra, lit,Re # Rc = remainder from Ra/lit 
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where the fourth character of the opcode denotes the size of the information unit, t = 1 
for longword operands and t = q for quadword operands, and where x is null for oper- 
ations with signed quantities and x = u for operations with unsigned quantities. 


These pseudo-instructions generate short instruction sequences that include setup 
and special calls to system-supplied support routines. It is essential to know that these 
calls use several temporary registers, namely R23, R24, R25, R27, and R28. Any previ- 
ous information in those particular registers will be overwritten and lost. 


RANDOM: A Callable Procedure or Function 


We will present an example of linking an external routine written in assembly language 
to a main program written in a high-level language, first as a procedure and then as a 
function. This routine, which produces successive “random” numbers, will also illus- 
trate calling service routines for computing the remainder from integer division. 


High-level languages have not always included an internal mathematical routine 
for producing pseudo-random values needed by applications such as Monte Carlo simu- 
lations. Vendors typically do now include such a routine as an extension when the stan- 
dard for a language lacks it. Nevertheless, a great deal has been written on the 
challenging topic of how to generate and test sequences of random numbers (see refer- 
ences: Ferrenberg, Hayes, Knuth, Park). 


Knuth has given guidelines for selecting the constants for the linear congruential 
method, in which successive values are generated by the simple relationship: 


Ry,=(m- Rn-1 +a) modulo d 


where m is a “well-chosen” constant multiplier, a is a constant additive term, and d is a 
constant divisor. On a binary computer, d is usually selected as 2”, where w is the natu- 
ral integer size in bits because the natural retention of only a fixed number of least sig- 
nificant bits from multiplication and addition is tantamount to a modulo operation. If d 
is a power of 2, then 1 is a suitable value for a. In addition, Knuth has shown that m 
should end with x21 (decimal), with x being an even digit. 


The “seed” value Ro is arbitrary. If it is a built-in constant, the routine will always 
produce the same sequence. Such repeatability of the sequence is sometimes useful in 
debugging or studying simulation algorithms, and that is one reason why people do not 
always want to use “black box” random-number generators provided with high-level 
languages. We will derive a seed value from the date and time obtained from the operat- 
ing system, thereby ensuring a different sequence each time the routine presented below 
is called. 
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RANDPROC: the Procedure Variant of RANDOM (OpenVMS) 


RANDPROC, the procedure variant of RANDOM, is shown in Figure 7.5. This routine 
produces a pseudo-random number within a desired [low, high] range. A third argument is 
used as an address pointer for storing the result at a location predetermined by the caller. 


.title RANDPROC Random Number Generator (OpenVMS) 


; Procedure RANDPROC(LOW,HIGH,RESULT) returns a randomly 

; generated integer between LOW and HIGH as RESULT. 

; The random integer is generated by the multiplier/adder 

; method with "MOD" used to bring the integer into the range 
; [LOW --> HIGH]. Seed is initialized using the system time. 


; Register use (procedures never save RO,R1): 

A RO - Future use 

; R1 - Temporary use 

: R9 - LOW value 

A R10 - Range = HIGH - LOW +1 

A R11 - RESULT address 

; R14 - Pointer to linkage section of this routine 
y R15 - Pointer to data section of this routine 

; R16 - LOW (first argument passed by reference) 

: R17 - HIGH (second argument passed by reference) 
; R18 - RESULT (third argument passed by reference) 


$routine randproc, data_section_pointer=true, - 
kind=stack, - 
saved_regs=<r9,r10,r11,r14,r15> 
Sdata_section 
SEED: . quad 0 ; Seed for random algorithm 
MULT : . quad 3141592653589793221 ; Multiplier 


Scode_section 


mov ha ee ah | ; Copy linkage section pointer 

. base #1i4,Sis ; R14 -> linkage section 

ldq £15, Sdp ; R15 -> data section 

. base r15,$ds ; Tell MACRO about this 
firsts ida t9, (x16) ; LOW value 

ldq ELO (x17) ; HIGH value 

mov 5 a nl EN ; RESULT address 

subq $0. ceo rig ; HIGH - LOW 

addq £101,710 ; R7 = range 

ldq r1,SEED ; First time called? 

bne r1l,gen ; No, go generate number 


Scall sys$gettim,args=<SEED/a> ; System time 
ldq r1,SEED ; Get first-time seed 


O_o 
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gen: ldg r22,MULT ; Get multiplier 
mulq OL , Pepe. + Ri = SEED * MULT 
addq riL ge ; Add one 
stq r1;SEED ; Seed for next call 
Scall ots$rem_ul,args=<R1/q,R10/q> 
addq rro; ro ; LOW + (RND mod RANGE) 
sta rO (EPLL) ; Store the result 
done:: mov Lae | ; Tell OpenVMS 
Sreturn > we're ending normally 
Send_routine randproc ; Needed by Sroutine 
.end ; Note no start address 


Figure 7.5 .RANDPROC: Callable procedure version of RANDOM 


When a routine is to call another routine, any “inherited” or “incoming” argu- 
ments in registers R16-R21 or F16—-F21 must be processed or preserved in other regis- 
ters or on the stack as local variables. In RANDPROC, for example, we have copied 
into register R9 the data pointed to by the first incoming argument, because the value 
LOW is needed again. We have also processed the data pointed to by the second argu- 
ment in order to derive the scaling range and store that in register R10. Moreover, we 
have saved the third argument (the address where the result is to be returned) in register 
R11. This processing or preserving of arguments has to be done before calling the sys- 
tem routine sys$gettim, since registers R16-R21 will again be used for passing 
“outgoing” arguments. 

The system routine sys$gettim passes back by reference one quadword of data 
representing the date and time as the number of 100-ns units since 00:00 on November 
17, 1858 (the Smithsonian base date and time for the astronomical calendar). The call to 
sys$gettim occurs only when RANDPROC is called the first time. Subsequently, 
each previously generated number is used as the generator (i.e., the “seed”) for the next 
one. Local storage on the stack is volatile, i.e., it is abandoned when a routine returns to 
its caller. Therefore, SEED is stored in the data section rather than on the stack. 

After the newly generated random value has been stored as SEED for the next 
occasion, the copy in register R1 is scaled into the desired range. The unsigned variant 
of the remainder routine from the system library, ots$rem_ul, is used to perform a 
modulo operation using the range as divisor. When quantities are unsigned, the mathe- 
matical “rem” and “mod” operations are identical. Since we had designed the routine to 
pass the result by reference, the stq instruction specifies the destination using the 
assembler syntax (R11). 

A suitable calling program written in Pascal for testing RANDPROC is 
TESTPROC as shown in Figure 7.6. 
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PROGRAM testproc(input, output); 


{ This Pascal calling program will be linked to the 
random number generator. In this version of the 
program, the random number is returned via the 
third argument of the procedure. } 


VAR i,random : unsigned6é4; 
LOW : unsigned64 VALUE 1; HIGH : unsigned64 VALUE 6; 


PROCEDURE randproc(lower, upper: unsigned64; 
VAR random : unsigned64) ; 

{ In DEC Pascal, parameters are passed by reference. } 

{ Here the third parameter receives a return value. } 


EXTERNAL; {Actual procedure defined elsewhere. } 
BEGIN 
FOR i. := 1 TO 20 DO 
BEGIN 


randproc (LOW, HIGH, random) ; 
writeln (random) 
END 
END. 
Figure 7.6 TESTPROC: Pascal main program for RANDOM procedure 


Although even rather elementary illustrations of programming in Pascal use 
numerous procedures, these are usually internal. The PROCEDURE keyword, the name 
of the internal procedure, and an argument list in parentheses are followed by a body 
(BEGIN ... END) of actual instructions. Several such procedures may precede a main 
program within a single source file, thus requiring only one compile command and a 
very simple link command. 

Pascal procedures can, alternatively, be maintained at the source level as external 
modules in separate files. In that case, each routine becomes a separate module. In the 
program or module where the routines are invoked, however, the compiler also needs 
certain information about arguments at the very least. In Pascal syntax, this accom- 
plished with an “empty” PROCEDURE statement that is immediately followed by the 
keyword EXTERNAL, as shown in Figure 7.6. 

While TESTPROC is being compiled, the Pascal compiler builds one side of the 
argument passing arrangements for an external procedure known only as randproc. 
This is done with absolutely no knowledge about what programming language has actu- 
ally been used for writing RANDPROC. As long as the other side of the argument pass- 
ing arrangements is ensured to be compatible, we could write RANDPROC in Pascal, 
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C, BASIC, FORTRAN, ..., or assembly language. The only requirement is conform- 
ance to the OpenVMS calling standard. 

The actual coding in TESTPROC is unexceptional. In order to take full advantage 
of the precision of 64-bit computing, we have seleced the unsigned64 data type. The 
procedure is invoked 20 times in a simple loop. Each time, the variable random will 
take on a newly computed value, and that value is written out. If the program is rerun, 
the sequence of 20 values is extremely likely to be different because RANDPROC will 
obtain a different seed value from the system date and time. 

The commands to produce and test the runnable program using the MACRO-—64 — 
assembler, the Pascal compiler, and the linker are of the form: 


$ macro/alpha randproc 

$ pascal/nooptimize testproc 
$ link testproc,randproc 

$ run testproc 


In the link command, TESTPROC must be specified first because the linker expects to 
find a suitable starting address in the first object file. Only the “main program” has such 
a transfer address. The source file for RANDPROC does not, and should not, have a 
symbolic address on its . end statement. 

We provide an equivalent version of RANDPROC for the Unix programming 
environment on the CD-ROM accompanying this book. The obligatory differences 
from the OpenVMS version correspond quite closely to the analogous situation for the 
function variant of RANDOM, discussed next in both OpenVMS and Unix versions. 


RANDFUNC: the Function Variant of RANDOM (OpenVMS) 


RANDFUNC, the function variant of RANDOM, is shown in Figure 7.7 for OpenVMS 
(and later in Figure 7.9 for Unix). This routine produces a pseudo-random number 
within a desired [low,high] range. When this function is called from a high-level lan- 
guage, the result appears to be associated with the name of the function but is actually 
conveyed “behind the scenes” via register RO. 
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.title RANDFUNC Random Number Generator (OpenVMS) 


; Function RANDFUNC(LOW,HIGH) returns a randomly generated 

; integer between LOW and HIGH by value in register RO. 

; The random integer is generated by the multiplier/adder 

; method with "MOD" used to bring the integer into the range 
; [LOW --> HIGH]. Seed is initialized using the system time. 


; Register use (procedures never save RO,R1): 

; RO - Returned value 

; R1 - Temporary use 

¢ R9 - LOW value 

z R10 - Range = HIGH - LOW +1 

; R14 - Pointer to linkage section of this routine 
; R15 - Pointer to data section of this routine 

A R16 - LOW (first argument passed by reference) 

; R17 - HIGH (second argument passed by reference) 


$routine randfunc, data_section_pointer=true, - 
kind=stack, - 
saved_regs=<r9,r10,r14,r15> 
Sdata_section 
SEED: . quad 0 ; Seed for random algorithm 
MULT: . quad 3141592653589793 221 ; Multiplier 


Scode_section 


mov T2714 ; Copy linkage section pointer 

. base rid.,$is ; R14 -> linkage section 

ldq r15,Sdp ; R15 -> data section 

. base ri5,Sdas ; Tell MACRO about this 
first: ldq £9. (716) ; LOW value 

ldq LG aC} ; HIGH value 

subq ¥L0, 29,220 ; HIGH - LOW 

addq Pi yk Fw ; R7 = range 

ldq ri, SHED ; First time called? 

bne r1,gen ; No, go generate number 

Scall sys$gettim, args=<SEED/a> ; System time 

ldq r1,SEED ; Get first-time seed 
gen: ldq r22,MULT ; Get multiplier 

mulg ri; rizr ; R1 = SEED * MULT 

addq i A e E £4 ; Add one 

sta r1, SEED ; Seed for next call 

Scall ots$rem_ul,args=<R1/q,R10/q> 

addgq YO, 39,20 ; LOW + (RND mod RANGE) 
done:: ; RO = computed value 

Sreturn ; to send back 

Send_routine randfunc ; Needed by Sroutine 

.end ; Note no start address 


Figure 7.7 RANDFUNC: Callable function version of RANDOM (OpenVMS) 
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In order to demonstrate the generality and capability of a fully-defined calling stan- 
dard, we have written very brief calling programs for RANDFUNC in several of the stan- 
dard high-languages that are available for OpenVMS (RANDPAS, RANDBAS, 
RANDFOR, RANDC, RANDCOB in Figure 7.8). We have attempted to make these Pascal, 
BASIC, FORTRAN, C, and COBOL programs as similar to one another in form as the vari- 
ations in syntax of these languages would permit. For instance, we have omitted the line 
numbers that would have been needed in a program written in original Dartmouth BASIC. 





PROGRAM randpas(input, output) ; 


{ This Pascal calling program will be linked to the 
function that generates random numbers. } 


VAR i : unsigned64; 
LOW : unsigned64 VALUE 1; HIGH : unsigned64 VALUE 6; 


FUNCTION randfunc(lower, upper: unsigned64) : unsigned64; 
{ In DEC Pascal, parameters are passed by reference. } 
EXTERNAL; {Actual procedure defined elsewhere. } 
BEGIN 
FOR i z= 1 TO 20 DO 
BEGIN 
writeln( randfunc (LOW,HIGH) ) 
END 
END. 





PROGRAM randbas 


! This BASIC calling program will be linked to the 
! function that generates random numbers. 


DECLARE LONG I 
MAP (PARAM) LONG LOW,Z1, HIGH, Z2 
! BASIC lacks direct support of 64-bit integers 


LOW = 1 \ HIGH = 6 \ 21,422 = 0 
! Simulated quadwords now have upper 32 bits of zero 


EXTERNAL INTEGER FUNCTION randfunc (INTEGER, INTEGER) 


! In DEC BASIC, scalars are passed by reference 
! by default. 


FOR. I = 1 TO 20 
PRINT randfunc (LOW, HIGH) 
NEXT I 


END PROGRAM 
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PROGRAM randfor 


* This FORTRAN calling program will be linked to the 
* function that generates random numbers. 


INTEGER*8 randfunc, I 
EXTERNAL randfunc 


* In DEC FORTRAN, scalars are passed by reference by default. 


DO 100 I s 1, 20 
PRINT *, randfunc(1, 6) 
100 CONTINUE 
END 


a 


/* This C calling program will be linked to the 
function that generates random numbers. */ 


#include <stdio.h> 
#ifdef _ VMS 
#include <ints.h> 
#endif 

main () 

{ 

#ifdef _ VMS 


uint64 randfunc(), i, LOW=1lul, HIGH=6ul; 
#else 

unsigned long int randfunc(), i, LOW=1lul, HIGH=6ul; 
#endiff 


/* In DEC C, scalars are passed by value by default, but 
&name forces it to pass an argument by reference. */ 


i=0; 

while ( i<20 ) 

{ 
printf( "%d\n",randfunc(&LOW, &HIGH) ); 
i++; 


m 


IDENTIFICATION DIVISION. 
PROGRAM-ID. RANDCOB. 


a aaae 
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This COBOL calling program will be linked to 
the function that generates random numbers. 
To compile, use command line 

S COBOL/NOOPT/ANSI RANDCOB 


+ + + + 


DATA DIVISION. 
WORKING-STORAGE SECTION. 
* Note that COMP for 18 digits equates to a quadword. 


01 LOWER PIC 9(18) COMP VALUE 1. 
01 UPPER PIC 9(18) COMP VALUE 6. 
01 RAN-DOM PIC 9(18) COMP. 

01 RAN-OUT PIC. SALS Fa 


PROCEDURE DIVISION. 
MAIN-LOOP. 

PERFORM 20 TIMES 
CALL "RANDFUNC" USING LOWER UPPER GIVING RAN-DOM 
MOVE RAN-DOM TO RAN-OUT 
DISPLAY RAN-OUT 

END- PERFORM. 

Figure 7.8 Test programs for RANDFUNC in several high-level languages 


These languages differ in their defaults for argument passing (Table 7.3), and we 
have included a few comments with each program. The main point of these comparisons 
is to show the argument passing and the success with a single version of RANDFUNC 
which is callable from many languages. Once that function has been tested and docu- 
mented, the writer of a calling program needs only to ensure that the arguments are for- 
mulated properly and that any obligatory “external” declarations are furnished to any 
selected language compiler. 

We have used the very simplest method of output afforded by each language, with 
no concern about left or right justification or the presence of leading zeros. That is, we 
are not concerned about the formatted appearance of the output here; rather, we are 
demonstrating equivalent intrinsic functionality. 


RANDFUNC: the Function Variant of RANDOM (Unix) 


An equivalent version of RANDFUNC for the Unix programming environment (Figure 
7.9) requires three sorts of modifications relative to the OpenVMS version. Of course 
the “packaging” differs, chiefly in the expression of prologue and epilogue regions. 
Second, the system routine that returns a useful time value is different. Finally, the man- 
ner of performing a remainder operation now entails a remqu pseudo-instruction that 
envelops an implicit system subroutine call (Unix) in place of an explicitly visible call 
to the ots$rem_lu system subroutine (OpenVMS). 
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/*  RANDFUNC Random Number Generator (Unix) */ 


/* Function RANDFUNC(LOW,HIGH) returns a randomly generated 
integer between LOW and HIGH by value in register RO. 

The random integer is generated by the multiplier/adder 
method with "MOD" used to bring the integer into the range 
[LOW --> HIGH]. Seed is initialized using the system time. 


Register use (procedures never save RO-R8): 


RO - Returned value 

R1 - Temporary use 

R9 - LOW value 

R10 - Range = HIGH - LOW +1 

R16 - LOW (first argument passed by reference) 
R17 - HIGH (second argument passed by reference) 
R23-R25,R27,R28 - Used internally by remqu */ 


.data 
SEED: . quad 0 # Seed for random algorithm 
MULT: . quad 3141592653589793221 # Multiplier 
TEN6: . quad 1000000 # 1,000,000 (one million) 
TIME: # To be accessed as quad! 
SECS: . Long 0 # Seconds 
USEC: . Long 0 # Microseconds 
VOID: . Long 0 # (info not needed) 
. Long 0 # (info not needed) 
REGS = 3 # Need to save 3 (incl R26) 
STACK = REGS # Quadwords needed 
FRAME = ((STACK*8+8) /16) *16 
.text # Section for program code 
lign 4 # Octaword alignment 
.set noreorder # Disallow rearrangements 
-globl randfunc # These three lines 
.ent randfunc # mark the mandatory 
randfunc: = function entry 
ldgp Sgp,0($27) # Load the global pointer 
lda $sp,-FRAME($sp) # Allocate stack space 
stq $26,0(S$sp) # Save our own exit address 
stq $9,8(Ssp) # Save R9 for the caller 
stq $10,16(S$sp) # Save R10 for the caller 
.mask 0x04000600,-FRAME # Saved R9,R10,R26 
.frame S$sp,FRAME,$26,0 # Describe the stack frame 
.prologue 1 # Say that Sgp is in use 
alobi first 
first: ldq $9, ($16) # LOW value 


ldq SLO, S17) # HIGH value 
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subg $10,$9,$10 # HIGH - LOW 
addq 510 dis $0 # R7 = range 
ldq $1, SEED # First time called? 
bne $1,gen # No, go generate number 
lda $16, TIME # Address of TIME structure 
lda S17, VOID # Address of VOID structure 
sr $26,gettimeofday # System time 
ldgp Sgp,0($26) # Restore global pointer 
ldq S0, TIME # Two unsigned longwords 
zap S0 ,OXE0,,.$1 # R1 = SEC 
SEL 50,324 $0 # RO = USEC 
ldq $2,TEN6 # R2 = 1,000,000 
mulgq C1 S27 08 # Seconds * one million 
addq S160, S1 # Add the odd microseconds 
stq $1,SEED # Store seed 
gen: ldq $22,MULT # Get multiplier 
mulg ot 299. S1 # R1 = SEED * MULT 
addq STi Si # Add one 
stq $1,SEED # Seed for next call 
remqu "T S00, S0 # Remainder from R1/RANGE 
addq $0,$9,$0 # LOW + (RND mod RANGE) 
done: # RO = computed value 
ldq $26,0(Ssp) # Restore exit address 
ldq $9,8(S$sp) # Restore R9 for the caller 
ldq $10,16(S$sp) # Restore R10 for the caller 
lda ċsp, FRAME ($sp) # Restore stack level 
ret So Ly ($26) el # Back to Unix environment 


end randfunc # Mark end of function 


Figure 7.9 RANDFUNC: Callable function version of RANDOM (Unix) 


Digital Unix provides a function, described on the gettimeofday(2) man 
page and in the <sys/time.h> header file, that returns time-related information. The 
first argument is the address of a structure consisting of two successive longwords to 
receive the number of seconds and microseconds since 00:00 on January 1, 1970 UTC 
(Coordinated Universal Time, formerly Greenwich Mean Time or GMT) passed back 
by reference. The second argument is the address of a second structure consisting of 
two successive longwords to receive other information (which we do not need), also 
passed back by reference. In addition, this function passes back a status code in register 
RO, which we will not use. 

We must comment on the use of logical shift instructions in the Unix version of 
RANDFUNC. Why do we not use 1d1 instructions for obtaining the longword-length 
pieces of information sent back by gett imeofday? That information is conceptually 
in the form of 32-bit unsigned integers. The Alpha 1d1 instruction performs sign exten- 
sion to a full 64-bit width, and that is not what we want to happen here. Instead, we use 
our knowledge that the two 32-bit returned values are adjacent and situated overall at a 
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natural quadword alignment. Therefore, we use the “trick” of an 1 dq instruction, fol- 
lowed by zap and sr1 instructions which isolate and zero-extend the two independent 
32-bit fields very efficiently. 

The commands to produce and test the runnable program, using C as the high-level 
language for the calling program in the Unix programming environment, are of the form: 


> cc -g -00 -o testfunc randc.c randfunc.s 
> testfunc 


Note that the main program (randc.c) contains main while the function (randfunc.s) 
does not. The loader looks for main in order to set the starting address in the PC regis- 
ter for fetching machine instructions. 


The $call Macro (OpenVMS) 


We have given small examples of the external calling capabilities of several high-level 
languages in the OpenVMS programming environment. We will conclude this chapter 
with a recapitulation of the capabilities of Alpha assembly language programs in regard 
to external calling. 

The system-supplied macro $cal1 organizes all of the details for a standard pro- 
cedure call. Although this macro recognizes more than a dozen parameters, many of 
those relate only to situations involving calls to translated VAX routines (not discussed 
in this book). The principal parameters are described in Table 7.6. 


Table 7.6 Parameters for the $call Macro 


Parameter Default Description 


name none Name of the routine (required). The "name=" can be omitted 
if the routine name is given as the first parameter. 


local false When false, $ca11 generates a call to a globally visible 
routine. When true, the call is to a routine that is not glo- 
bally visible. 


args none List of arguments to pass to the called routine, enclosed 
within angle brackets (< >). Each argument is an address 
expression followed by a qualifier (see Table 7.7) that guides 
the assembler in selecting the appropriate load and store 
instructions to use and in building the argument information 
for register R25. 


scratch_regs R0, R1, Optional list of different registers that $cal1 may use when 
FO, Fl processing arguments with the args parameter. At least one 
"R" register is needed to process integer or address arguments, 
and one "F" register to process floating-pointing arguments. If 
given more registers, $ca11 can work more efficiently. 


OO 
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The $call routine requires a few registers in order to process the argument list. 
In general, some form of load instruction moves each argument into a scratch register, 
and then some form of store instruction moves the argument into the proper register 
(R16-R21, F16—-F21) or memory location on the stack. The $call macro can use a 
more efficient mov pseudo-instruction instead of a load followed by a store if an argu- 
ment is a small constant or is contained in one register and is going to be passed in 
another register. The scratch_regs parameter gives the assembly language pro- 
grammer or a compiler control over which registers are used for that purpose. 

Table 7.7 lists the qualifiers for various argument types, excluding VAX-compatible 
floating-point data types that we do not discuss in this book, and shows the type of load 
and store instructions as well as the code packed into the argument information register 
(R25) for the first six arguments. Note that the store instructions are always 64-bit forms, 
stqand stt. 


Table 7.7 Qualifiers for the args Parameter 


Qualifier Argument Type Load instruction Store Instruction Al Encoding 
/a Address lda sta 164 (code 0) 
fal Longword 1dl stq 164 (code 0) 
ra Quadword ldq stq 164 (code 0) 
/s S-floating lds sts FS (code 4) 
st T-floating ldt stt FT (code 5) 


The $call1 routine evaluates all of its parameters for format and consistency, and 
carries out the following actions: 


1. Produces a linkage pair for the called routine in the caller’s linkage section. The 
first quadword of a linkage pair will contain, after linking, the entry point address 
of the called routine. The second quadword of a linkage pair will contain, after 
linking, the procedure value (i.e., the address of the procedure descriptor) of the 
called routine. A temporary label $1p points to this most recently generated link- 
age pair. 

2. Allocates stack space for arguments (beyond the first six). 

3. Generates instruction sequences that load the arguments into registers or onto the 
stack. 

4. Constructs the bit-encoded value in the argument information register (R25). 

5. Produces the following call instruction sequence as part of a standard prologue. 


ldq R26,S$l1p > R26 -> routine to call 
ldq R27,Slp+8 + R27 -> procedure descriptor 
JSF R26, (R26) > Call the routine 


SecaeEneEReeee ee ENCE REESE es 
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The complement to the $call macro, which appears in the calling routine, is the 
Sreturn macro that we discussed previously, which may appear in one or many 
places in the called routine. 


Summary 


A stack provides temporary storage that supplements the general purpose registers, 
and its operation as a last-in first-out (LIFO) dynamic data structure assists in the 
implementation of algorithms that benefit from reversing the order of items. In many 
traditional architectures, the stack pointer participates directly during subroutine call 
and return instructions to help preserve a return address and other context informa- 
tion. In the Alpha architecture, call and return per se do not alter the stack pointer, but 
instead are implemented as jump instructions that preserve, at most, a return address 
in a register. 

A subroutine is called from many different places, operates without explicit 
knowledge of where it was called from or the state of variables in the calling program, 
and returns to the next instruction in sequence in the calling program. In their most gen- 
eral and fully supported form, calling and called routines share only the context infor- 
mation explicitly passed as arguments. Arguments may be passed by value, by 
reference, or by descriptor using registers and stack space. System-supplied macros 
may assist the assembly language programmer in defining several varieties of subrou- 
tines, procedures, or functions and in saving selected registers on the stack. 

Numerous support routines are typically supplied as part of a programming envi- 
ronment, such as access to the system-maintained date and time or capabilities like divi- 
sion and remainder operations that supplement the set of machine instructions. 
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EXERCISES 


7.1 Explain whether the addressing convention for the stack for Alpha programming 
could have been reversed. What would be the address of the surface element (i.e., the 
most recently added item) if the stack were built in that other direction, i.e., from low 
address values toward higher address values? 


7.2 Explain why a different choice of flag value on the stack would have to be used in the 
program DECNUM? if the values of the digits were first stored on the stack and were 
not converted into ASCII characters until later, in the STORE section of the program. 
Suggest a suitable flag value. How would the loop control instructions have to be 
altered? Would they be less efficient? 


7.3 Identify all the changes that would be necessary in DECNUM2 if we wanted to use 
the other convention of a stack that builds toward increasing addresses. 


7.4 Adapt DECNUM? to display its results using chrput. You should store a newline 
character at the end of the generated string of digits. 


7.5 What does the following sequence accomplish? 


br Rx, LOG 
Loe: 


7.6 What is wrong with the following instruction? 
Loc: br Rx, Loc 


7.7 Explain how one might proceed to implement in Alpha assembly language the fol- 
lowing case structure from Pascal: 
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CASE Char OF 


GO Te. h : It_is := "a numeral'; 
Ay aes HL", FO » ap? s It is := “a wowel's 

IA out ' : 3 {Empty} 

OTHERWISE s Itis := *punctuation!' ; 
END 


Consult your instructor to inquire whether you are expected to write a complete pro- 
gram, or just the central portion for the case structure. 


7.8 What are the respective advantages and disadvantages of the bsr and jsr instruc- 
tions? 


7.9 What are the restrictions on the use of R16 ... R21 for scratch purposes within a 
given routine if it contains one of the following? 


a. No procedure calls 
b. Calls to other routines 
c. Recursive calls to itself 


7.10 What is the maximum number of arguments that a procedure could have in the Open- 
VMS programming environment on an Alpha system? Why? Is there such a limit in 
the Unix programming environment on an Alpha system? Why or why not? 


7.11 What is the length limitation (if any) for these situations involving strings? 
a. String defined using .asciz (OpenVMS) or .asciiz (Unix) 
b. String defined using .ascic (OpenVMS) 
c. String defined using a standard string descriptor (OpenVMS) 


7.12 Use the debugger to inspect the standard prologue and epilogue for one of your pre- 
viously written programs. Explain the function of every instruction. 


7.13 (OpenVMS only) Explain whether register R27 conveys enough information that a 
routine could be written which could produce a description of itself, somewhat like 
manually using the debugger’s examine/instruction command. 


7.14 Describe all the changes that would be necessary if RANDPROC were to be rede- 
signed to pass arguments by value rather than by reference. Hint: the instruction on 
the line just before the label done would not be stq RO,R11. 


7.15 On what (approximate) date will the 32-bit number of seconds since 00:00 on Janu- 
ary 1, 1970, as maintained in all Unix systems, first become negative in the two’s 
complement sense? Will the 64-bit number of 100-ns time units since 00:00 on 
November 17, 1858, as maintained in OpenVMS systems, become negative in the 
two’s complement sense sooner or later than that? 


net 
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7.16 (OpenVMS only) Devise dummy routines using the assembler and the debugger to 
find out what single instruction or sequence of instructions is developed by the 
$call macro for the following arguments passed in a register as one of the first six 
arguments: 


a. 34 

b. -34 

c. 32767 

d. R5 

e. (R5) 

f. label in data section 


7.17 (OpenVMS only) Repeat exercise 7.15 for each of the arguments passed in memory 
on the stack as a seventh or subsequent argument. 


7.18 (Unix only) Devise a satisfactory instruction or sequence of instructions to pass in a 
register as one of the first six arguments each of the quantities listed in exercise 7.15. 


7.19 Write an assembly language procedure INTSORT(List,N) that sorts a list of N quad- 
words stored at address List. Assume a relatively small N and use a simple algorithm, 
such as an insertion sort. Pass the arguments to the procedure by reference from a 
main program written in a high-level language of your choice, for testing. 


7.20 Write an assembly language function and test it from a main program written in a 
high-level language of your choice. Pass arguments to the function by reference, and 
pass a quadword result back in register RO. Select one of the following: 


a. Function P(N,R) that calculates the number of permutations of N things taken R at 
a time, 


P(N,R) = N * (N-1) * ... * (N-R+1) assuming OS RS<N. 


b. Function GCD(X, Y) that calculates the greatest common divisor of two positive 
quadword integers X and Y using the algorithm 


while X + Y do replace X by IX—YI and replace Y by MIN(X, Y). 
c. Function LOW(List,N) that finds the lowest value in a list of N signed integers. 


d. Function POS_LOW(List,N) that returns the index location (counting from zero) 
where the lowest value occurs in a list of N integers. 


7.21 (Strongly recommended) Incorporate all of the additional Alpha instructions from 
this chapter into your personal summary chart(s). 





CHAPTER 8 


Floating-Point 
Operations 


5S tudents learning science at pre-college and univer- 
sity levels encounter the usefulness of “scientific notation” for representing numeric val- 
ues, especially in situations where the magnitudes vary widely, such as masses ranging 
from that of an electron (9 x 10-3! kg) to that of the planet Earth (6 x 10+24 kg). Today, 
one can purchase a “scientific” calculator that typically deals with numbers spanning 
magnitudes from 10-9? to 10+99 for only a little more money than a simpler calculating 
device that deals only with numbers represented by about 8 decimal digits partitioned by 
a decimal point that can only move within the span of those few digits. Indeed, the 
proper learning of the quantitative sciences importantly includes the appreciation, for 
example, that 6.022137 x 1023 expresses only what is currently known of the magnitude 
of Avogadro’s number, i.e., the number of atoms of isotopically pure 12C having a total 
mass of exactly 12 grams. In contrast, the many trailing zeros in the lay person’s repre- 
sentation 602,213,700,000,000,000,000,000 are not “significant” because they are not 
presently known, even though more of them are certainly knowable in principle. 


Early digital computers, like the electromechanical calculating devices in use 
prior to the development of computers, only dealt with data represented as integers (i.e., 
fixed-point numbers). Scientists and engineers who used such calculators or the earliest 
computers therefore had to “scale” their numeric quantities manually, i.e., to track the 
orders of magnitude on the side by themselves. That task is not really difficult, albeit 
sometimes tedious and error-prone. 


With the appearance of FORTRAN, the first high-level programming language for 
Scientists and engineers, the computers intended for scientific work came to incorporate 
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direct architectural support in the instruction set for floating-point numbers, typically 
stored as a binary fraction combined with a power of either 2 or 16. Other computer 
architectures (or perhaps other implementations), intended for different uses such as 
business applications, sometimes lacked floating-point machine instructions. Digital 
Equipment Corporation produced integer-only implementations of the PDP-11 archi- 
tecture as well as more powerful models with floating-point support. This distinction 
has continued almost up to the present, in such cases as an IBM PC AT® to which a 
“math coprocessor” could be added or Apple Macintosh computers capable of having 
floating-point support (Motorola 68040 CPU) or not (Motorola 68LC040 CPU). As 
time went on, however, with the design of the most advanced computer architectures 
being highly influenced by the need for computing power in science and engineering, 
floating-point instructions became a standard component of the most powerful architec- 
tures, e.g., with all implementations of the VAX architecture and with many processors 
used in Unix workstations. Even with RISC architectures, the “reduction” in the overall 
extent of instruction sets has invariably left room for floating-point support. 

In this chapter, we outline the [EEE-compliant floating-point instructions of the 
Alpha. As mentioned previously (Chapter 2), we have elected not to discuss the VAX- 
compatible floating-point instructions at all. Computer designers today are mindful of the 
IEEE conventions; moreover, the need for backwards compatibility with any manufac- 
turer’s own previous proprietary convention will progressively peter out in the future. 
Thus we feel that anyone just now learning the principles of floating-point manipulations 
should concentrate his or her attention on the rationalized, industry-standard IEEE repre- 
sentation. Furthermore, readers pursuing more of a computer science emphasis than a 
computer engineering emphasis may elect to skim through this chapter, whose details are 
not essential to an understanding of any of the subsequent chapters in our book. 


Parallels between Integer and Floating-Point Instructions 


When one learns elementary mechanics in a physics course, certain parallels are often 
illustrated between the analysis of circular motion and the standard description of linear 
motion: angular momentum and linear momentum, moment of inertia and mass, etc. 
Drawing such parallels can assist the learning process both mnemonically, through the 
similarities, as well as conceptually, through the contrasts. We have already discussed 
integer instructions for RISC architectures in related groupings, specifically for the 
Alpha, in Chapters 4 through 6. Several of those groupings of instructions for data in 
integer formats have direct analogues in instructions for data in floating-point formats, 
as depicted in Table 8.1. The absence of integer division in the Alpha contrasts with the 
presence of floating-point division, as noted in earlier chapters. 
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Table 8.1 Comparison of Integer and Floating-Point Instructions 
Type of Instruction Integer Floating-Point 
Load/Store Lael, Ladd lds, ldt 
stl, sta sts, stt 
Arithmetic addl, addq adds, addt 
subl; suba subs, subt 
mull, mulg muls, mult 
(none) divs, divt 
Conditional Move cmoveg fcmoveg 
cmovne fcmovne 
cmovlt fcmovlt 
cmovge fcmovge 
cmovle fcmovle 
cmovgt Ecmovgt 
Signed Compare cmpeg cmpteq 
cmplt cmptlt 
cmple cmptle 
Other Comparisons cmpule, cmpult (none) 
(none) cmptun 
Conditional Branch beq fbeq 
bne fbne 
DIE fblt 
bge fbge 
ble fble 
bgt fbgt 


Other groupings of integer instructions have no floating-point counterparts. Those 
include the logical, shift, and byte-manipulation instructions (and the data-independent 
jump instructions). Conversely, we shall see in this present chapter that a few groupings 
of floating-point instructions have no direct integer counterparts. 


IEEE Special Values 


Before we consider floating-point instructions, we need to provide a bit more detail 
about how the IEEE representation accommodates certain special values in addition to 
ordinary floating-point values. All the various circumstances are outlined in Table 8.2. 
A brief overview of these circumstances will help to make clear some of the side effects 


of certain floating-point and conversion instructions. 
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Table 8.2 Meanings of Special IEEE Floating-Point Representations 


Exponent Fraction IEEE Meaning Finite? 
All ones Non-zero + NaN No 
All ones Zero + Infinity No 
Zero Non-zero + Denormal No 
Zero Zero +O Yes 
Other Anything Non-zero, normalized Yes 


The classification of circumstances depends both on whether the exponent field 
contains all zero bits, all one bits, or any other bit pattern and on whether the fraction 
field contains all zero bits or any other bit pattern. One circumstance is special coding 
for “not a number” (NaN), i.e., a deliberately invalid number stored in an information 
unit. Another circumstance is special coding for positive and negative infinity. A third 
circumstance is special coding for so-called denormalized numbers, i.e., numbers 
whose fractions have not been shifted so as to begin with the hidden bit mentioned in 
Chapter 2. Such numbers have values lying between zero and the smallest normalizable 
numbers given in Table 2.2. The exact value zero is also a special assigned pattern of 31 
or 63 zero bits in addition to a sign bit. Only zero and normalized numbers are consid- 
ered to be mathematically finite. 


Load and Store Floating-Point Instructions 


Like the load and store integer instructions, the load and store floating-point instruc- 
tions provide the only access pathway for data between memory storage and the CPU. 
These data transfer instructions belong to the memory class (Figure 4.1), where 5 bits 
<25:21> designate a floating-point register for the data, 5 bits <20:16> designate an 
integer base register, and 16 bits <15:0> designate a signed byte offset from the base 
register, as summarized in Figure 8.1. Ignoring the VAX-compatible instructions, we 
have pairs of load and store instructions for S_floating and T_floating data (Table 8.3). 


Class Registers 
31 26 25 21 20 16 15 0 


Figure 8.1 Format of Alpha floating-point instructions 
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Table 8.3 Alpha Floating-Point Load and Store Instructions 


Mnemonic Opcode Purpose 
lds 22 Load Fa. with S_floating contents found at effective address 
Lae 23 Load Fa with T_floating contents found at effective address 
sts 26 Store S_floating contents in Fa at effective address 
stt ZA Store T_floating contents in Fa at effective address 





All of the load and store floating-point instructions use direct addressing for one 
register operand (Fa) and either indirect or displacement addressing for the other oper- 
and using Rb (see Figure 8.1). The operand in memory must be naturally aligned. 


Loading and Storing 64-bit Data 


For the 1dt and stt instructions, a 64-bit data value is moved between floating-point 
register Fa and the information unit at the effective source or destination address, respec- 
tively. This transfer operation moves all 64 bits in parallel into corresponding bit posi- 
tions, unaltered. The storage conventions for a T_floating datum in memory or in one of 
the 32 floating-point registers are the same (Chapter 2). Indeed, the processor does not 
inspect or interpret the data and does not “know” whether those 64 bits actually represent 
a quadword-length integer, a T_floating number, or 8 arbitrary sequential bytes. 

Since the 1dt and stt instructions merely move 64 bits, a quadword-length inte- 
ger moved into a floating-point register “looks” just the same as in an integer register, 
that is, bit 63 conveys the sign, bit 62 holds the most significant bit, and bit 0 holds the 
least significant bit. We shall see later in this chapter why one might have use for hold- 
ing a quadword-length integer in one of the 32 floating-point registers. 


Loading and Storing 32-bit Data 


In the case of longword integers, we saw (Chapter 4) that loading a 4-byte datum 
resulted in automatic sign extension. In that way, subsequent arithmetic instructions 
would give sensible results regardless of whether those instructions were quadword- or 
longword-oriented. We also saw (Chapter 6) that short sequences involving byte-manip- 
ulations could be used when a longword datum was to be loaded from memory as an 
unsigned integer quantity. In the other direction, we saw that the st1 instruction 
ignores bits <63:32> in the register and moves only bits <31:0> to a longword-sized 
information unit in memory. 

Similarly, for the 1ds and sts instructions, a 32-bit datum is adjusted when 
moved between a floating-point register and the information unit at the effective source 
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or destination address. Since the storage conventions for an S_floating datum in mem- 
ory or in one of the 32 floating-point registers are different (Chapter 2), the CPU hard- 
ware Carries out conversions. For the 1ds instruction, the following conversions occur: 


Memo Register 
Bit 31 (sign of number) Bit 63 


Bit 30 (“sign” of biased exponent) Bit 62 

Bits <61:59> set by algorithm 
Bits <29:23> (rest of exponent) Bits <58:52> 
Bits <22:0> (normalized fraction) Bits <51:29> 

Bits <28:0> are zero 


The three expansion bits in the exponent field are determined in such a way as to pre- 
serve both normal values and special IEEE flagging values (e.g., “not a number”). The 
algorithm is as follows: 


Memory <30:23> Register <62:52> 


aL LEL COLLU 
| XXXXXXX 1 OOO xxxxxxx (XXXXxxx not all ones) 
O XXXXXxxX O 111 xxxxxxx (XXXXxXxx not all zeros) 
0 0000000 0 000 0000000 


For the sts instruction, the following conversions occur in the other direction: 


Register Memory 

Bit 63 (sign of number) Bit 31 

Bit 62 (“sign” of biased exponent) Bit 30 

Bits <61:59> not stored 
Bits <58:52> (rest of exponent) Bits <29:23> 
Bits <51:29> (normalized fraction) Bits <22:0> 
Bits <28:0> not stored 


As a result of these mappings, a longword-length integer datum moved from a floating- 
point register is reconstituted from bits <63:62, 58:29> with bits <61:59> and bits 
<28:0> ignored. 


Addressing of Floating-Point Data in Memory 


In these Alpha memory-reference instructions, the displacement is allocated only 16 
bits. Since this displacement is interpreted as a signed byte address offset, only a range 
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of data addresses from -32768 to +32767 with respect to register Rb is accessible. If 
some data needed at the same time are more widely separated than 64K addressing 
units, more than one base register Rb can be used. Often, several registers will be used 
rather naturally for that purpose within a routine because several logically distinct data 
structures are being accessed. 


In the Alpha architecture the floating-point load instructions (lds and 1dt) and 
store instructions (sts and stt) all work in the same basic way with respect to calcu- 
lation of the effective address. The signed 16-bit displacement found in bits <15:0> of 
the instruction word is sign-extended to full 64-bit width and added using two’s com- 
plement arithmetic to the value in the register Rb whose name (number) is found in bits 
<20:16> of the instruction word. The numeric result is the effective address submitted 
to the memory storage subsystem of the computer. Of course, the effective address is an 
unsigned quantity. If the value in register Rb were zero and the displacement were —8, 
the effective address would actually correspond to the highest possible quadword 
address for the Alpha (i.e., the address space would appear to “wrap around”). 


For the load instructions, the contents of the quadword- or longword-length infor- 
mation unit at the effective source address are brought into the central processor. With 
lds, the longword-length datum is itself converted to full 64-bit width before being put 
into the destination register Fa. That way, the value will automatically work with either 
S_floating or T_floating operate instructions without any further conversion. With all 
the load instructions, the destination for the data is the register Fa whose name (num- 
ber) is found in bits <25:21> of the instruction word. The assembler syntax is: 


ldt Fa,disp (Rb) 


ldt Fa,source 


where the information unit type t = s for S_floating data or t = t for T_floating data, 
and where disp is a signed displacement to be added to the current address value in 
register Rb or where Source is a symbolic address that is within 32K addressing units 
from a base register value known to the assembler (OpenVMS) or a symbolic address 
reachable using the global pointer using a two-instruction sequence generated by the 
assembler (Unix). 


For the store instructions, the source for the data is the register Fa whose name 
(number) is found in bits <25:21> of the instruction word. The contents of this register 
(stt) or the longword-length converted contents of this register (sts) are dispatched 
from the central processor for storage at the destination specified by the effective 


address. The assembler syntax is: 
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stt Fa, disp (Rb) 
stt Fa,destination 


where the information unit type t = s for S_floating data or t = t for T_floating data, 
and where disp is a signed displacement to be added to the current address value in 
register Rb or where dest ination is a symbolic address that is within 32K address- 
ing units from a base register value known to the assembler (OpenVMS) or a symbolic 
address reachable using the global pointer using a two-instruction sequence generated 
by the assembler (Unix). 

All of these floating-point load and store instructions require that the information 
unit in memory be naturally aligned, that is, the effective address for a T_floating quan- 
tity is exactly divisible by 8 (bits <2:0> in the address must be zero) or for an S_floating 
quantity is exactly divisible by 4 (bits <1:0> in the address must be zero). If the speci- 
fied information unit is not naturally aligned, a hardware exception will occur that the 
operating system will have to process. 


Floating-Point Arithmetic Instructions 


We elect to describe floating-point arithmetic instructions, which are part of the class of 
operate instructions, as the first systematic discussion of the large group of instructions 
involving non-integer quantities provided in the Alpha instruction set using the floating- 
point registers. 


Addition, Subtraction, Multiplication, and Division 


The Alpha architecture provides all four basic floating-point arithmetic instructions, 
with a syntax closely resembling the integer arithmetic instructions (Chapter 4): 


addt Fa,Fb,Fc > Fe <— Fa + Fb 
subt Fa,Fb,Fc ; Fc <— Fa - Fb 
mult Fa, Pb, Fc ; Po <— Fa * Fp 
divt Fa,Fb,Fc ; Fe <— Fa * Fb 


where the fourth character of the opcode denotes the type or size of the information 
unit: t = s for S_floating operands and t = t for T_floating operands. In S_floating 
operations, the result in register Rc has the S_floating register representation and is 
thus ready for any follow-on S_floating-point operation involving it. Each operand is 
found by direct addressing in the standard forms just given. That is, the assembler 
will produce an instruction word containing the 5-bit numerical address for each 
named register. 
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The instructions for floating-point arithmetic share opcode 16. The function codes 
in bits <15:5> of these instructions distinguish between S_floating or T_floating oper- 
ands and denote certain other variations of these instructions. The basic forms of these 
instructions are listed in Table 8.4. We mention other related function codes later in this 
chapter. 

Unlike the integer arithmetic instructions, the Alpha floating-point arithmetic 
instructions do not have any alternate provision for a literal operand in place of register 
Fb. This limitation arises because the function code bits are more fully used for special 
qualifiers having to do with issues of rounding and error trapping. 


Table 8.4 Alpha Floating-Point Arithmetic Instructions 


Mnemonic* Opcode Function Code Purpose 
adds 16 80 S_floating addition 
addt 16 AO T_floating addition 
subs 16 81 S_floating subtraction 
subt 16 Al T_floating subtraction 
muls 16 82 S_floating multiplication 
mult 16 A2 T_floating multiplication 
divs 16 83 S_floating division 
divt 16 A3 T_floating division 





*Valid combinations of suffixes (Unix) or /qualifiers (OpenVMS) may be added for rounding (c,d,m) and 
trapping (i,s,u). 


The composition or identification of a binary Alpha instruction is best left to com- 
mands of the debugger. Nevertheless, we now give one example of manual assembly. 
Consider the instruction mult F1,F2,F3: 


opcode Fa Fb function code Fc 
16 01 02 0a2 03 (hex fields) 
01 0110 0 0001 0 0010 000 1010 0010 0 0011 (bit fields) 
0101 1000 0010 0010 0001 0100 0100 0011 (binary) 
5 8 2 2 1 4 4 3 (hexadecimal) 


Some of the other bit positions within the function code range can be used to specify 
such things as the method of rounding. 


Rounding 


The IEEE standard for binary floating-point arithmetic specifies a “round to nearest” 
convention as the default behavior, which the Alpha likewise implements as its own 
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default. If two nearest representable values are equally near, the one having zero for the 
least significant bit of the fraction is chosen. This latter behavior corresponds to a long- 
standing practice among scientists of rounding decimal numbers of the form 0.d...x5 by 
choosing x to be even (sometimes called “unbiased rounding to even”). The everyday 
practice (as used in financial accounting) where the trailing 5 cases are rounded “up” 
can lead to systematic numeric bias when already-rounded quantities are later used to 
compute other derived quantities. Such adverse consequences are taken up in numerical 
analysis books. 

The IEEE standard also prescribes three additional modes of directed roundings, 
as follows: 


e rounding toward zero, called “chopped rounding” for the Alpha, which is speci- 
fied by the c suffix (Unix) or the /c qualifier (OpenVMS) on the name of instruc- 
tions where rounding applies; 

e rounding toward minus infinity, specified for the Alpha by the m suffix (Unix) or the 
/m qualifier (OpenVMS) on the name of instructions where rounding applies; and 

e rounding toward plus infinity, which is selectable on the Alpha indirectly through 
“dynamic rounding” by first using the d suffix (Unix) or the /d qualifier (Open- 
VMS) on the name of instructions where rounding applies. Then special dedicated 
instructions can inspect or set various codes in numerous bit fields in a floating- 
point control register (FPCR) to control the rounding mode. 


We will not discuss these alternative rounding modes or the FPCR and its associated 
instructions in very much detail in this book. 


Exceptions 


The IEEE standard for binary floating-point arithmetic identifies five types of excep- 
tions that hardware and/or software should be capable of detecting: 


e invalid operations such as 0/0; 

e division by zero; 

e overflow, when the rounded result exceeds the largest finite number of the destina- 
tion format; 


e underflow, when the rounded result is smaller than the smallest finite number of 
the destination format; and 


* inexact result, when the rounded result differs from the infinitely precise result. 
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The first three conditions will always produce an exception trap on the Alpha, and the 
system software environment is supposed to respond appropriately. When a special suf- 
fix (Unix) or qualifier (OpenVMS) is added to the instruction name, a program may 
specify how floating underflow (u) and inexact results (i) are to be handled in hard- 
ware or with software assistance (s). 


The Alpha also provides for signaling of overflow when converting large floating- 
point numbers to integers, by means of a v suffix (Unix) or /v qualifier (OpenVMS) on 
the instruction. 


Specific bits in the floating-point control register (FPCR) turn on when one of 
these six exceptions occurs. RISC systems require extra processor cycles to detect and 
report exceptions synchronously. Therefore it is common to give the programmer (or 
compiler writer) a choice: either rapid calculations ignoring error conditions, or slower 
calculations with tracking of exceptions. We will not discuss arithmetic exceptions or 
the FPCR and its associated instructions in very much detail in this book. 


Function Field in Floating-Point Instructions 


The 11 bits <15:5> comprising the function field in the Alpha floating-point instruc- 
tions are allocated in clusters for encoding both the identity and the manner of operation 
desired for an instruction: 


Bits <15:13>, having to do with over- and underflow, specify trapping modes 
resulting from the optional suffix (Unix) or qualifier (OpenVMS) values ap- 
pended to the instruction name (i, u, v, s). Certain values for this 3-bit field 
are unsupported. The default code 000 specifies “imprecise” handling. 


Bits <12:11>, having to do with rounding, specify rounding modes resulting 
from the optional suffix (Unix) or qualifier (OpenVMS) values appended to 
the instruction name (c, d, m) and additional codes in the floating-point con- 
trol register (FPCR) in the case of d. The default code 10 specifies normal un- 
biased rounding. 

Bits <10:9> specify the data type of the source operand, i.e., S_floating, 
T_floating, or quadword. (One code value is “reserved.”) 


Bits <8:5> pinpoint the particular instruction among the many which share 
the same numeric opcode. (Several of the code values are “reserved.”’) 


Because of all these encoding requirements, which exceed what was required for the 
integer operate group, the designers of the Alpha floating-point operate group did not 
provide any way to incorporate a “literal” in place of register Fb in such instructions. 
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Floating-Point Branch Instructions 


The format of the Alpha branch instructions (Figure 4.1) includes the standard 6-bit field 
for the opcode, a 5-bit field for the register name (number) that contains the value to be 
tested, and a 21-bit field for the branch displacement. Six conditional branch instructions 
express various tests based on the value in floating-point register Fa (Table 8.5). 


Table 8.5 Alpha Floating-Point Branch Instructions 








Mnemonic Opcode Purpose 
fbeq a Branch if contents of register Fa is equal to zero 
fbne 35 Branch if contents of register Fa is not equal to zero 
EDLE 32 Branch if contents of register Fa is less than zero 
fbge 36 Branch if contents of register Fa is greater than or equal to 
Zero 
fble a Branch if contents of register Fa is less than or equal to zero 
fbgt at Branch if contents of register Fa is greater than zero 





In all respects the floating-point branch instructions work just like their integer 
counterparts. Recall that the 21-bit displacement field does not explicitly store the low- 
est two bits of the actual address displacement. The CPU hardware multiplies the appar- 
ent displacement encoded in this field by 4 and then adds the resultant scaled value to 
the current value in the program counter (PC). 

Consider the following schematic example containing forward and backward 
branches in a short section of program: 


again: opA 


fbxx Fy,onward ; Conditional forward branch 
more: OpB 
br Rz,again ; Unconditional backward branch 


onward: 


Assume for the sake of this present discussion that opA and opB are single instructions, 
though more realistically they could be sequences of many instructions. 


When the fbxx instruction is executing, the program counter has already been 
incremented to point to the instruction at label more. Suppose that the xx condition is 
met and the branch is to be taken. If so, this branch instruction needs to advance the pro- 
gram counter (PC) by 2 longwords (8 address units). In other words, the assembler 
should encode the displacement as 2 (0 0000 0000 0000 0000 0010) in the fbxx 
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instruction. During the execution cycle for the £bxx instruction, the number 8 will be 
added to the already updated value in the PC. 

The conditional branch instructions occur in three sets of logical opposites. These 
sets (Ebeq/fbne, fblt/fbge, fble/fbgt) express a signed arithmetic compari- 
son of the value in register Fa to the number zero. For instance if register F2 contains 
the number 3.0, the instructions fone, fbge, and fbgt would be taken but the 
instructions fbeq, fblt, and fble would fall through. 

Both “plus” and “minus” zero are treated as the ideal mathematical zero. A non- 
zero value anywhere within bits <62:0> with sign bit 63 set represents a quantity less 
than zero, while a non-zero value anywhere within bits <62:0> with sign bit 63 clear 
represents a quantity greater than zero. 


Floating-Point Compare Instructions 


The Alpha provides a set of floating-point compare instructions (Table 8.6) that com- 
plement the fbeq/fbne conditional branch instructions. These instructions share 
opcode 16 with the floating-point arithmetic instructions (see also Table 4.1). 


Table 8.6 Alpha Floating-Point Compare Instructions 


Mnemonic* Opcode Function Code Purpose 
cmpteq 16 AS Compare T_floating equal 
cmptlt 16 A6 Compare T_floating less than 
cmptle 16 A7 Compare T_floating less than or equal 
cmptun 16 A4 Compare T_floating unordered 





*The s suffix (Unix) or /s qualifier (OpenVMS) may be added for software trapping. 


The floating-point compare instructions, like the floating-point arithmetic 
instructions, use registers Fa and Fb as source operands and register Fc as the destina- 
tion operand: 


cmptxx Fa,Fb,Fc s Fo <= 2.0 3E “Ba. tor Fb” is true 
else Fc <- 0.0 


The destination register Rc is set to the non-zero value 2.0 (0x4000000000000000) if 
the Boolean comparison “Fa xx Fb” is true and is set to 0.0 if that condition is false. 
The designers of the Alpha architecture recognized that not all imaginable com- 
pare instructions are actually needed. Note that “compare less than P,Q” is the same as 
“compare greater than Q,P” and that “compare less than or equal P,Q” is the same as 
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“compare greater than or equal Q,P.” As we might expect from the parsimony of a RISC 
design, the possibility of redundancy was avoided here for the case of two source oper- 
ands in registers. 

These arithmetic comparisons ignore the sign of floating-point zero; for exam- 
ple, +0.0 tests equal to -0.0. Comparisons involving plus or minus infinity execute 
normally. 

The special “unordered” relation in the cmptun instruction is true if one or both 
operands is NaN (not a number). 


Floating-Point Conditional Move Instructions 


The fastest instructions in any computer architecture are those where all necessary oper- 
ands are already contained in registers. Making copies of data in registers is a very com- 
mon requirement in programming algorithms. Conditional branch instructions have 
many applications, but they can retard a program considerably when a sequence is bro- 
ken. Any anticipatory fetching of instructions by the hardware has to be abandoned, and 
fetching of a new instruction sequence must restart at the new address put into the pro- 
gram counter when a branch is taken. 

The Alpha architecture offers a set of conditional move instructions not only for 
integers, but for floating-point quantities. These latter instructions are classified as 
floating-point operate instructions with three register operands. 


fcmovxx Fa,Fb,Fc ; Fc <- contents of Fb if "Fa xx 0" true 


Like branch instructions, the contents of register Fa is first tested. The particular logical 
test “xx” is encoded in the function code field in the instruction word. If the test comes 
out true, then the contents of register Fb is copied into register Fc. If the test comes out 
false, register Fc is not modified. 

Conditional move instructions share opcode 17 with certain other sorts of float- 
ing-point instructions, which we take up later in this chapter, that share the property of 
being independent of the floating data type (S_floating, T_floating, or VAX-compati- 
ble). These conditional move instructions correspond to the function codes listed in 
Table 8.7. Because they incorporate a logical test without the potential slowdown asso- 


ciated with taking a PC-altering branch, these instructions are among the most powerful 
in the Alpha instruction set. 
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Table 8.7 Alpha Floating-Point Conditional Move Instructions 


Mnemonic Opcode Function Code Purpose 
fcmoveq 17 2A Conditional move if Fa = 0 
fcmovne 17 2B Conditional move if Fa <> 0 
fomovlt 17 2C Conditional move if Fa < 0 
fcmovge 17 2D Conditional move if Fa >= 0 
fcmovle 17 2B Conditional move if Fa <= 0 
fcmovgt 17 2F Conditional move if Fa > 0 


As with the corresponding integer conditional move instructions, these instruc- 
tions have numerous potential applications. Some examples are the following: 


fcmoveg Fa,Fb,Fc equivalent to fbne Fa,skip 
addt F31,Fb,Fc 


emptlt F1,F2,F3 equivalent to Fl = MAX(F1,F2) 
fcmovne P3;8e2,FL 


fcmovne F31,F31,F31 equivalent to fnop 


Here the first £cmoveg example shortens the program by one instruction, and avoids a 
conditional branch. The second example implements a common function ordinarily 
found only in high-level languages with a simple sequence of two machine-language 
instructions. The cmpt 1t instruction sets F3=2.0 if F1 is less than F2, or equivalently if 
the condition F2>F1 is true; the fcmovne instruction then copies F2 into F1 only if F2 
was indeed greater than F1, since otherwise the previous instruction would have set 
F3=0.0. The third example is an instruction that does nothing, because F31 is perma- 
nently zero and thus cannot ever be unequal to zero. Almost every computer architecture 
has some form of no-operation, no-op, or nop instruction. The Alpha does not allocate a 
unique opcode for nop, since an fcmov instruction (and actually some other special 
cases) can readily provide an equivalent. Thus fnop is an Alpha pseudo-instruction. 


Copy Sign Instructions 


Floating-point numbers are stored and manipulated as sign and magnitude quantities. 
The Alpha provides special copy sign instructions for selective manipulation of the sign 
bit or the sign bit and the 11-bit exponent field. These instructions (Table 8.8) share 
opcode 17 with several other sorts of floating-point instructions that have the property of 
being independent of the floating data type (S_floating, T_floating, or VAX-compatible). 
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Table 8.8 Alpha Floating-Point Copy Sign Instructions 





Mnemonic Opcode Function Code Purpose 
cpys 17 20 Copy sign 
cpysn 17 21 Copy sign negated 
cpyse 17 ee Copy sign and exponent 


The Alpha copy sign instructions involve three operands, using registers Fa and 
Fb as source operands and register Fc as the destination operand: 


cpys Fa,Fb,Fc ; Fc <— sign of Fa, rest from Fb 
cpysn Fa,Fb,Fc ; Fe <— negated sign of Fa, then Fb 
cpyse Fa,Fb,Fc ; Fe <— Fa<63:52>, then Fb<52:0> 


No checking of operands is performed and no exceptions are signaled. 
These copy sign instructions form the basis for useful floating-point pseudo- 
instructions (e.g., absolute value, negation, copying, or fnop): 


fabs Fx, Fy is implemented as cpys F31,Fx, Fy 
fneg Fx, Fy is implemented as cpysn Fx, Fx, Fy 
fmov Fx, Fy is implemented as cpyse EX ; Fx, FY 
fnop is implemented as cpys Pody ok, FOL 


The cpyse instruction provides a direct means to isolate the sign and exponent (choos- 
ing F31 as Fb) or to rescale a number by any power of two (using an appropriate sign/ 
exponent in Fa). 


Data Conversion Instructions 


Advanced CISC architectures like the VAX have typically supported many data types, 
and have frequently featured special instructions which would convert among some of 
those data representations. Those traditional architectures also usually contain one set 
of “general purpose” registers for integer and floating-point purposes. 


RISC architectures typically have divided register sets. Indeed, it is common (6.9. 
Alpha, MIPS, PowerPC) for there to be no internal datapaths connecting the integer 
registers with the floating-point registers and no instruction types permitting an integer 
register to be a data source and a floating-point register to be a destination, or vice 
versa. Load and store instructions must therefore be involved, overall, in converting 
data initially and afterwards stored in memory locations. 
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The norm among RISC architectures is to support some forms of data conversions 
using the floating-point registers (and the floating-point load and store instructions) 
almost exclusively. This provides some overall balance in utilization of different parts 
of the CPU. Remember that integer registers are heavily involved in logical manipula- 
tions, subroutine linkage, and other non-arithmetic purposes. 

The data conversion instructions involving the Alpha floating-point registers are 
enumerated in Table 8.9 (omitting those involving VAX-compatible floating-point for- 
mats). These share either opcode 16 or opcode 17 with other instructions of the float- 
ing-point operate group and would, in principle, involve three registers. Since these 
conversions intrinsically involve only two operands, however, the Alpha requires regis- 
ter Fa to be F31 for any of these instructions. 


Table 8.9 Alpha Data Conversion Instructions 


Mnemonic* Opcode Function Code Purpose 
eytt 16 OAF Convert T_floating to quadword 
cvtqs 16 OBC Convert quadword to S_floating 
evtgt 16 OBE Convert quadword to T_floating 
cvtts 16 OAC Convert T_floating to S_floating 
cvtst 16 2AC Convert S_floating to T_floating 
cvtlq 17 010 Convert longword to quadword 
cvtql 17 030 Convert quadword to longword 





*Valid combinations of suffixes (Unix) or /qualifiers (OpenVMS) may be added for rounding (c,d,m) and 
trapping (i,s,u). 


Figure 8.2 summarizes the data conversion pathways supported by combinations 
of integer and floating-point instructions on the Alpha. 
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Figure 8.2 Data conversion pathways 


APPROXPI: Using Floating-Point Instructions 


We now illustrate several of the Alpha floating-point instructions in a program that 
involves a probabalistic, or Monte Carlo, strategy for approaching a computational 
application. 

The transcendental number TU can be evaluated using many different mathematical 
techniques. One method, which we do not hold up as the best by any means, has its 
basis in the recognition that the area of a circle of radius a is Ta? while the area of a 
square just large enough to enclose the circle is 4a2 (see Figure 8.3). The ratio of the 
area of the circle to the area of the square is Tla2/4a2 or 70/4. Thus if we could sepa- 
rately measure the areas of the circle and the Square, we could estimate T to be 4 times 
the ratio of the area of the circle to the area of the Square. 


nnn 
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Figure 8.3 Circle and enclosing square for estimating pi 


Consider a method of mathematical estimation which one might develop by anal- 
ogy to the British pub game of darts. Shoot an arbitrary number of darts at the circle and 
square, not trying to have a good aim but rather trying to shoot in a uniformly unbiased 
manner everywhere within the square. Consider T to be 4 times the total number of hits 
within the circle divided by the total number of shots that landed inside the square. 

Because of symmetry, what is true for the whole circle and the whole square is 
also true for the quarter circle and quarter square in the quadrant where x and y are both 
positive. If we could draw uniformly random fractions within [0,1] as x and y, each hit 
would satisfy the inequality x2 + y? <= 12. Program APPROXPI (Figure 8.4) imple- 
ments these ideas in Alpha assembly language. 


/* APPROXPI Monte Carlo approximation of pi (Unix) */ 


STACK 
FRAME 


1 # Need to save R26 
( (STACK*8+8) /16) *16 


/* Incorporates an in-line adaptation of RANDFUNC to get 
a randomly generated 64-bit integer which is then made 
into a floating-point value between 0 and 1 (exclusive). 
The random integer is generated by the multiplier/adder 
method. The seed is initialized using the system time. 


Register 
RO 
R1 
R3 


R22 


FO 
F1 
F2 
F3 
F4 
F5 


use: 


Temporary use 

Temporary use 

Number of Monte Carlo shots 
- Multiplier for random number generation 
0,5 

1.40 

Temporary use 

Temporary use 

Number of hits 

Number of shots 


R16 - Pointer to control string for printf 


aaa  eeeeeeeeeSSSFSeSeEeseSesesSE 
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HALF: 
TOPS: 
SEED: 
MULT: 
TEN6: 
TIME: 
SECS: 
USEC: 
VOID: 


PRNT: 


main: 


tirst: 


loop: 
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F17 - Computed value to print */ 


Floating-point 0.5 
Number of shots to make 
Seed for random algorithm 


3141592653589793221 # Multiplier 


1,000,000 (one million) 
To be accessed as quad! 
Seconds 

Microseconds 

(info not needed) 

(info not needed) 
Format for printf 


Section for program code 
Octaword alignment 
Disallow rearrangements 
These three lines 
mark the mandatory 
'main' program entry 
Load the global pointer 
Allocate stack space 
Save our own exit address 
# Saved only register R26 
Describe the stack frame 
Say that Sgp is in use 


of TIME structure 
of VOID structure 


Address 
Address 


$26,gettimeofday # System time 


.data 
.t_floating 0.5e0 # 
. quad 1048576 # 
. quad 0 H 
. quad 
. quad 1000000 # 
# 
. Long 0 = 
. Long 0 z 
. long 0 = 
. Long 0 3 
ASLI “S£\n" # 
.text = 
Alten, á E 
.set noreorder = 
.globl main = 
.ent main # 
# 
ldgp Ş$gp, 0 ($27) # 
lda Ssp,-FRAME(Ssp) # 
stq $26,0(S$sp) # 
.mask 0x04000000, -FRAME 
.frame S$sp,FRAME,$26,0 # 
.prologue 1 2 
slopl firet 
lda $16, TIME x 
lda $17, VOID 8 
yer 
ldgp Sgp,0 ($26) # 
ldq $0, TIME # 
zap $0; 0xf0, $1 # 
srl $05 32550 = 
ldq $2, TEN6 s 
mulg SL, S2, Si # 
àddq $1.90, S1 H 
ldt $f0, HALF # 
addt SE0,6£0, 651 S 
cpys Sf31,8£31,S£4 # 
cpys SE31,,8£31,S£5 # 
ldq $3,TOPS H 
ldq $22, MULT # 
mulq SMr ai # 
addq e e Ey l # 
stq $1, SEED # 
ldt $f2, SEED z 


Restore global pointer 
Two unsigned longwords 
Ki = SEC 

RO = USEC 

R2 1,000,000 

Seconds * one million 
Add the odd microseconds 
Floating-point 0.5 
Floating-point 1.0 
Counter for hits 

Counter for shots 
Number of shots to do 
Get multiplier 

R1 = SEED * MULT 

Add one 

64 random bits 

Move into floating register 
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cpyse S£0, S22 ,Si2 # Scale to range 0.5 to 1 
subt S£2,5£0, S22 # Scale to range 0 to 0.5 
addt Sf£2,S2£2,5f£2 # Scale to range 0 to 1 
mult S£2,S£2,Sf2 ts > & 
mulg SI 822,51 # R1 = SEED * MULT 
addq Si, 145k # Add one 
stq $1, SEED # 64 random bits 
ldt S£3,SEED # Move into floating register 
cpyse STO; S3 ST # Scale to range 0.5 to 1 
subt Sf3,S20,S53 # Scale to range 0 to 0.5 
addt Sf3, 523 ,Sf3 # Scale to range 0 to 1 
mult SE; S53 SE3 # Y * Y 
addt SEF  S£3 , SZ #xX*X+yY* Y 
cmptle S$£f2,$f1,$£2 # If outside the circle 
fbeq S£2,skip # then it is not a hit 
addt STA; SEL STA # Count a hit 
skip: addt St5,; Sti S5 # Count the shot 
subq $3,183 # Count down... 
bgt $3, LOOp # ...until done 
agave $f£4,5£5,S£5 # Find hits/shots 
addt S£5,S¢5 ,Si5 # Multiply by 4.0 to get 
addt S£5,S£5,S£17 # pi = 4 * hits/shots 
lda $16, PRNT # Format for printf 
jsr S26, printi 
ldgp Sgp,0($26) # Restore global pointer 
done: mov 0,80 # Signal all is normal 
ldq $26, (Ssp) # Restore exit address 
lda $sp, FRAME ($sp) # Restore stack level 
ret S31, ($26) ,1 # Back to Unix environment 
.end main # Mark end of procedure 


Figure 8.4 APPROXPI: Illustrating the use of floating-point instructions 


We have already presented an algorithm for producing sequences of pseudo-random 
64-bit integers (RANDFUNC, Figure 7.7). We have pruned away all extraneous aspects 
of argument-passing and have re-engineered that algorithm as part of APPROXPI. We 
elected to use a cpyse instruction to overwrite the 12 most significant bits of each gen- 
erated 64-bit pseudo-random integer with the sign and exponent of 0.5. The remaining 
52 generated bits with the hidden bit thus complete a T_floating number within [0.5,1]. 
This has to be subtractively offset or shifted by 0.5 to be within [0,0.5] and then multi- 
plicatively scaled by doubling to be within [0,1]. We go through these steps twice, the 
first time to pick a pseudo-random x and the second time to pick a pseudo-random y for 
each shot. 

Each such point (x,y) is tested to determine whether it lies within the quarter circle 
(a hit) or only within the quarter square (a shot). APPROXPI contains the parameter 
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TOPS set for a suggested total of about a million shots. The estimate for TT is computed 
directly in register F17 in anticipation of calling a print routine. 

The Unix version of APPROXPI (Figure 8.4) calls the printf function in the C 
support library. A first argument should always be the address of the null-terminated 
control string that specifies how to format the line of printed output. Any second or sub- 
sequent arguments relating to numeric quantities should be passed by value in R17—R21 
or in F17—F21, as appropriate for each integer or floating-point quantity, continuing 
with values on the stack if necessary. The OpenVMS version of APPROXPI (on the 
CD-ROM that accompanies this book) similarly uses the $call macro to call 
decc$txprintf, the appropriate variant C support library routine when one or more 
T_floating quantities are among the items to be printed. 

Convergence of this algorithm is quite slow. Ten runs of this program yielded a 
mean of 3.141 (standard deviation 0.002). If we assume that Poisson statistics apply 
to this type of calculation, the standard deviation in relation to the total number of 
shots is the square root of the number of shots. Thus we would expect a standard 
deviation of 0.1% from a million shots, and that is roughly borne out by the one 
“experiment” just cited. 

In spite of the slow convergence, similar Monte Carlo calculations are actually 
used for some problems in science and engineering, such as numerical integration in 
many dimensions. Note that our example of the circle and square is tantamount to 
“computing” an area, i.e., estimating the value of an integral (in two dimensions) that 
measures the area bounded by the equation of the circle and the orthogonal axes. 


Summary 


Computations for applications in science and engineering require the flexibility of scal- 
ing that floating-point representations provide. Computer systems and high-perfor- 
mance workstations marketed, in part, for such applications have usually incorporated 
support for floating-point operations as part of overall instruction set design. 

RISC designs, such as the Alpha, typically depart from the CISC tradition of one 
set of universal registers and instead provide a dedicated set of floating-point registers 
as part of a major section of the CPU devoted to floating-point arithmetic and related 
data conversions. Such partitioning of the CPU into physically separate floating-point 
and integer units can lead to performance gains of at least two sorts. First, the separate 
units can be designed with different depths of pipelining appropriate to the operations 
carried out by each. Second, the strict “division of labor” may make possible the simul- 
taneous execution of one floating-point instruction in one unit of the CPU and one inte- 
ger instruction in another unit. 
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The Alpha architecture extends this duality to the provision of a separate set of 
branching instructions, based upon values in floating-point registers, that are executed 
in the floating-point section of the CPU in a manner not unlike the operation of the set 
of branching instructions, based upon values in integer registers, that are executed in the 
integer/Boolean section of the CPU. Other design choices are possible. The PowerPC 
architecture has a universal set of branch instructions that test “condition codes” set by 
a previous floating-point or integer calculation, while the MIPS architecture has a dif- 
ferent branching scheme somewhat intermediate in character between those of the 
Alpha and the PowerPC. 

In this chapter we have covered the encoding and basic operation of the Alpha 
instructions which involve the floating-point register set. We concluded the chapter with 
a programming example that used several of those types of instructions. 


REFERENCES 


Alpha Architecture Committee, Alpha Architecture Reference Manual, 3rd ed. Newton, Mass.: Butter- 
worth-Heinemann (Digital Press), 1998. 

Heath, Steve, PowerPC: A Practical Companion. Oxford, UK: Butterworth-Heinemann Ltd., 1994. 

IEEE Standard for Binary Floating-Point Arithmetic, ANSI/TEEE Std 754-1985. New York: Institute of 
Electrical and Electronics Engineers, 1985. 

Patterson, David A. and John L. Hennessy, Computer Organization and Design: The Hardware/Software 
Interface, 2nd ed. San Francisco, Cal.: Morgan Kaufmann Publishers, Inc., 1998. 

Stallings, William, Computer Organization and Architecture: Designing for Performance, 4th ed. Upper 
Saddle River, N.J.: Prentice Hall, Inc., 1996. 

Wilkins, Charles L., Charles E. Klopfenstein, Thomas L. Isenhour, Peter C. Jurs and, for BASIC programs 
from chemistry (Part II), James S. Evans and Robert C. Williams, /ntroduction to Computer Program- 
ming for Chemists—BASIC Version. Boston, Mass.: Allyn and Bacon, Inc., 1974. 


EXERCISES 


8.1 Discuss why a computer architecture may exhibit a considerable degree of similarity 
between its integer instruction set and its floating-point instruction set. 


8.2 In a tag sort, the final loop that does actual data movement is driven from a list of 
addresses for the desired order of the sorted list of elements. A temporary storage 
element is always required to “interchange any A with any B.” Show that a floating- 
point register could be used for this purpose even if the data elements to be inter- 
changed are actually integers. Can you think of circumstances when a programmer 
(or compiler) might make this choice of register instead of an integer register? 


8.3 When two floating-point quantities are to be added, their binary points must be 
aligned by shifting bits in the fraction and simultaneously incrementing or decre- 
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8.4 


8.5 


8.6 


8.7 


8.8 


8.9 
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menting the exponent field of one of them. Explain whether a more accurate result 
should be obtained by adjusting the smaller or the larger number. 


The integers that a computer can represent using 64 bits are evenly spaced. The float- 
ing-point numbers that a computer can represent using 64 bits are not all evenly 
spaced apart. Explain. 


The signum function, SGN(x), is defined as +1, 0, -1 according to whether the argu- 
ment is positive, zero, or negative. Write what you believe to be the shortest possible 
implementation of an in-line instruction sequence computing the signum function of 
data type T_floating for an argument of type T_floating (argument in register Fx, 
result in register F0). 


What form of copy sign instruction can be used to “move” the quantity zero into any 
floating-point register? 


Rewrite DECNUM2 to use data conversion instructions and floating-point division 
by 10. What is the largest magnitude of integer (expressed as a power of 2) which can 
be treated exactly by this method? 


Modify APPROXPI to print the mean and variance from N=10 replications for a 
given value of TOPS. Compute the standard deviation manually, from the variance. 
Explore resetting different values of TOPS to determine whether the standard devia- 
tion and/or the discrepancy between TU and the mean scale in proportion to the square 
root of TOPS (i.e., draw a graph plotting those results vertically versus the square 
root of TOPS horizontally). 


Write an assembly language function and test it from a main program written in a 
high-level language of your choice. Pass arguments to the function by reference, and 
pass a result back in register RO or FO as appropriate. Select one of the following: 


a. Function LOW(List,N) that finds the lowest value in a list of N signed floating- 
point numbers. 


b. Function POS_LOW(List,N) that returns the index location (counting from zero) 
where the lowest value occurs in a list of N signed floating-point numbers. 


8.10 (Strongly recommended) Incorporate all of the additional Alpha instructions from 


this chapter into your personal summary sheet(s). 


CHAPTER 9 


Conditional Assembly 
and Macros 


Any assembler has the paramount task of converting 
mnemonic instructions and symbolic expressions into valid instructions with internally 
consistent addresses for all references to data, as we have discussed in Chapter 3 and 
elsewhere. In addition, some assemblers provide numerous features intended to lessen 
the tedium of producing repetitious patterns of data definitions or instructions, as well 
as to reduce the likelihood of erroneous programming. In this chapter, we take up the 
topics of repeat blocks, conditional assembly, macros, and the related specific assem- 
bler directives for the MACRO-64 assembler that provide the programmer with some 
convenient extensions of the minimal assembly language for the OpenVMS program- 
ming environment. 

By contrast, the Unix assembler provided by Digital Equipment Corporation con- 
tains very few capabilities analogous to the material presented in this chapter. 


Those of our readers who are primarily interested in the machine design aspects of 
computer architecture may wish to skim this chapter, since little of the material here 
relates directly to architectural aspects of the Alpha in particular or computers in gen- 
eral. Those of our readers who are more interested in learning how system software is 
developed, or in gaining a more comprehensive understanding of programming envi- 
ronments accessible to system developers and compiler writers, may wish to devote 
closer attention to this chapter, which rounds out the description of MACRO-64 as a 
computer language. 


Readers who already know PDP-11 MACRO or VAX MACRO should have an 
experience of déjà vu here, since this family heritage has strongly influenced the syntax 
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and features of MACRO-—64 for the Alpha. Some directives with new style, spelling, or 
semantics are also accepted in the corresponding VAX format, but in this chapter we 
intend to show only the new formats introduced with MACRO-—64. (In order to assure 
the longest future life for a piece of software, a programmer is always well advised to 
heed the warning signals from vendors, in the case of assembly language, or the various 
standards-setting boards, in the case of high-level languages: avoid perpetuating the use 
of “declining features” and adopt new or recommended features instead.) 


Repeat Blocks 


MACRO-64 provides two varieties of repeat blocks, those with an explicitly specified 
repetition count and those which are called indefinite repeat blocks. These capabilities 
can assist with the formulation of highly repetitive code segments. 


Simple Repeat Blocks 


The astute reader may already have speculated whether an assembler would provide any 
aids for continuing the repeating pattern of instructions required by the algorithm for 
the SQUARES program presented in Chapter 1. If we wanted to extend the program 
toward some much larger total number of squares, would we really have to “copy and 
paste” with the text editor ad nauseam, or is there some alternative to that? 

A repeat block indeed provides an alternative. A sequence of instructions, and 
other assembler directives if desired, can be repeated by surrounding them with 
.repeat and .endr directives: 


.repeat HowManyTimes 

<sequence of instructions and directives> 

. endr 
The parameter denoted here as HowManyTimes may be a constant, a previously 
defined symbolic value, or an expression that MACRO-—64 can evaluate explicitly. If the 
value is less than or equal to zero, the repeat range is not assembled at all. If the value is 
greater than zero, the sequence is repeated the specified number of times. 


In Figure 1.2, a sequence of three Alpha instructions must be repeated for each of 
the N squares to be calculated: 


addq R2,R1,R1 ; Adjust first difference 
addq R1,RO,RO ; RO = nth square 
stq RO,SQn ; to be stored 
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But if we were to replicate the same symbolic location SQn for the store instruction, we 
would not produce a list of squares stored in successive quadwords; each computed 
value would overwrite the previous value. The simple .repeat directive lacks any 
capability to vary the symbolic location SQn. A programmer or compiler writer typi- 
cally provides only one symbolic name for an entire vector, not a symbolic name for 
each element. With that clue, we might set up an additional register as a storage pointer 
and write a repeat block like this: 


N = <whatever we want> 

lda R3,; SO ; R3 -> storage area 
.repeat N 

addq R2;R1;R1 ; Adjust first difference 
addq R1,R0,RO ; RO = nth square 

stq RO, (R3) ; to be stored 

lda R3,8(R3) ; Advance the data pointer 
. endr 


Having accomplished this rather modest elaboration in the code section, and a corre- 
sponding relabeling in the data section, we should have little reluctance to write a loop- 
less program like SQUARES for a much larger N than the previous single-digit value. 

Simple repeat blocks have the limitation of no variability at all. The other types of 
repeat blocks and macros are more versatile, though sometimes harder to devise and 
perfect. Simple repeat blocks can, however, contain most of the other constructs in this 
chapter and gain versatility that way. 


Indefinite Repeat Blocks Using the .irp Directive 


As we have just seen, simple repeat blocks merely produce some N identical replica- 
tions of a sequence. Greater power or usefulness would seem to require a capability to 
vary some crucial detail within the sequence. An indefinite repeat block fills this need, 
using the . irp directive in the following way: 


LEH symbol,<parameter list> 
<sequence containing 'symbol'> 
„enär 
where symbo1 is a formal parameter, i.e., just a pro forma symbolic place-holder. This 


formal parameter takes on each successive actual parameter value enumerated in the 
list, i.e., a different value during each repetition through the sequence. These parameter 
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values, which are separated by commas and enclosed within a set of angle brackets, 
comprise a set over which the formal parameter can vary. If the list is empty, no expan- 
sion occurs. 

The symbol for an indefinite repeat block may appear any number of times or in 
any field of instructions (or directives). Each occurrence is subjected to the same text 
substitution. The process resembles the find and replace operation within a word pro- 
cessor. 

A reprise of the DOT_3 program (Figure 4.4) can serve to illustrate how .irp 
works. You will recall that the symbolic representations of the vector components 
(X,Y,Z) appear in the code section of DOT_3 in a completely symmetrical way. Such 
symmetry stands out as a tip-off that a repeat block introduced by .irp could be 
devised as follows: 


Lp COMP, <X, Y, 2> 


ldg R1, COMP (R14) ; R1 = COMP component of V1 
ldg R2,COMP (R15) ; R2 = COMP component of V2 
mulgq R1,.R2, RL ; R1 = COMP*COMP now 

addq R1,RO0,RO ; Update the sum 

. endr 


The expansion will reproduce our original central portion of DOT_3. (Readers with 
access to a system with MACRO-—64 should verify this for themselves.) 

Such a use of . irp can make the .M64 source file clearer because our attention is 
drawn to a general pattern, not to a plethora of particular instances. The appearance of 
the corresponding listing file may seem clumsy, however; it is very difficult to preserve 
flawless spacing and tabbing in the expansions of repeat blocks and macros. 


Indefinite Repeat Blocks Using the .irpc Directive 


Another variety of indefinite repeat block uses the .irpc directive and, again, the 
.endr directive: 


~LEpC symbol ,<string> 
S of containing 'symbol'> 
ere 
For each repetition of the body of the repeat block, symbo1 takes on the value of the 


next character in the string. Hence the length of the string determines the number of 
repetitions (none at all if the string is null). 
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There is no fundamental difference between the . irp and .irpc directives, but 
the latter obviates the need for a lot of commas and “a"x" constructs when only single 
characters are to be specified for substitution. Moreover, the string for use with . irpc 
can contain punctuation characters as the valid parameter values, e.g., <+-*/> or <.;:?>. 

In order to illustrate .irpc very simply, let us imagine the need to test whether 
the scratch copy of an ASCII code in register RO specifies a vowel. If so, register R1 is 
to be set to 1; if not, then register R1 is to be set to 0. Since there is no formulaic rela- 
tionship connecting the ASCII codes for the vowels, we have little choice but to devise 
an enumerated case structure: 


bic RO, *x20,R0 ; Force to upper case 
Mheas n CHAR, <AEIOU> ; Look for vowels 
cmpeq RO, “a"CHAR",R1 + IS it CHAR? 

bne R1,ahead ; Yes 

. endr 


ahead: 


The expansion will produce five two-instruction subsequences: 


cmpeq RO “a A* RI ; Is it A? 
bne R1, ahead ; Yes 
cmpeq Roa" E" pRL ; Is Ab E: 
bne R1, ahead ; Yes 
cmpeq RO, *a* I*, Ri » Te se TF 
bne R1, ahead ; Yes 
cmpeq RU, “a"0” , Ri < Is it 0? 
bne R1, ahead ; Yes 
cmpeq RO, atu”, R1 ; Is it U? 
bne R1, ahead ; Yes 


Note that the textual substitution of each successive character occurs within both the 
operand field and the comment field. 

Both .irp and .irpc, like . repeat, can also contain macro definitions. Con- 
versely, a macro definition can contain any combination of these types of repeat blocks. 


Conditional Assembly 


Many programming environments, including some high-level languages, provide for 
conditional assembly or conditional compilation based upon situation-dependent 
parameters. The motivations for such a capability, and some of the situations where it is 
useful, have historically included: 
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e bracketing segments of a program initially written as aids during development and 
debugging, which may contain useful internal documentation in the comments but 
which are not needed in the production version; 

e producing simplified versions of a product; examples have included subset ver- 
sions of the RT-11 and RSX families of PDP-11 operating systems; 

e accommodating the presence or absence of particular hardware features; 

e tailoring a general library macro to produce appropriate code for a specific situa- 
tion. 


Deriving such alternative versions of software from a common source file contributes to 
consistency, long-term maintainability, and perhaps even provability of correctness 
when meticulously done. 

MACRO-64 provides two levels of conditional assembly. These are a traditional IF- 
block style of conditional assembly, discussed in this section, and a lexical preprocessing 
capability similar to that of certain high-level languages, not discussed in this book. 


Conditional Assembly Blocks Using the .if Directive 


A conditional assembly block is bounded by a . if assembler directive at the top and a 
. endc assembler directive at the bottom: 


a condition test argument (s) 


range of lines containing assembly language statements or assembler 
directives 


ende 
where condition test specifies the circumstances under which the range will be 
assembled based on a test of the arguments. The condition test must be sepa- 
rated from the argument(s) by a comma, space(s), or tab(s). Table 9.1 lists the condi- 
tional tests provided by MACRO-64. These are typical of the features of assemblers 


in general, though the syntax will differ for the programming environments for other 
architectures. 
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Table 9.1 Condition Tests for the .if and .iif Conditional Assembly Directives 





Condition Test 


Short Argument Number of Condition that Assembles 

Long Form Form Type Arguments the Block 
EQUAL EQ Expression 1 or2 Expression-1 is equal to, or is not 
NOT_EQUAL NE equal to, expression-2. 
GREATER GT Expression 1 or 2" Expression-1 is greater than, or is 
LESS_EQUAL LE less than or equal to, expression-2. 
LESS_THAN LT Expression 1 or 2” Expression-1 is less than, or is 
GREATER EQUAL GE greater than or equal to, expres- 

sion-2. 

DEFINED DF Symbolic 1 Symbol is defined, or is not 
NOT_DEFINED NDF defined. 
BLANK B Macro 1 Argument is blank, or is not blank. 
NOT_BLANK NB argument 
IDENTICAL IDN Macro 2t Arguments are identical, or are 
DIFFERENT DIF argument different. 





If the second argument is omitted, the comparison is made against 0. 


f Lower case string arguments are converted to upper case before being compared unless the string is sur- 
rounded by double quotes. 


The arguments to be tested are symbolic arguments or expressions that can be 
completely assessed because everything within the expression has been previously 
defined. If there are two arguments, they must be separated by a comma. 

The range of lines bracketed by . if and . endc either will be completely consid- 
ered for assembly (and for interpretation of any nested conditionals or macros) or else 
will be entirely omitted from consideration. When a conditional range is not consid- 
ered, any new symbols introduced within that range will not become defined. 

Conditional blocks may be nested to a depth of 100. If an outer condition is not 
satisfied, the inner conditionals will not be considered for assembly. The .if and 
.endc directives must be strictly matched. Consider the following example: 


Ree a defined SYMBOL1 
Outer range (A) 


SLE defined SYMBOL2 
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Inner range (B) 
.endc 
Outer range (C) 


.endc 


The following table summarizes which of the three ranges will be assembled, depend- 
ing on whether SYMBOL1 and/or SYMBOLZ2 has been previously defined: 


SYMBOL1 SYMBOL2 What is assembled 
Undefined Undefined Nothing 

Defined Undefined A, G 
Undefined Defined Nothing 

Defined Defined A; By € 


Notice that there is no circumstance in which range B would be assembled but ranges A 
and C would not be assembled. If such a result were desired, it could be achieved by 
inserting one or another of the additional directives described below. 


Single Conditional Alternative Using .else Directive 


A standard .if conditional block is parallel to an IF...THEN...ENDIF construct in a 
high-level language. MACRO-64 also provides the . else directive in order to support 
the equivalent of an IF...THEN...ELSE...ENDIF construct in a high-level language. 

The .else directive partitions the body of the conditional block into a “THEN” 
portion lying between the .if and .else directives and an “ELSE” portion lying 
between the .else and .endc directives. If the condition test is met, all the lines 
between .if and .else are assembled. If the condition test is not met, all the lines 
between .else and .endc are assembled. For example: 


AE greater P,Q 

Assemble line(s) here if P > Q; "THEN" portion 
.else 

Assemble line(s) here if P <= Q; "ELSE" portion 
.endc 


If it were necessary to handle the situation P=Q differently from the situation P<Q, a 
second conditional block could be nested inside the “ELSE” portion. 
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The .else directive, which can legitimately occur only inside a conditional 
block bounded by .if and .endc directives, resembles the .if_false directive. A 
conditional block can contain only a single instance of .else, but .if falsecanbe 
used multiple times within the span of one conditional block (discussed next). 


Alternating the Conditional Assembly Using True-False Sense 
Switching 


Some programming applications have situations in their preliminary, main, and con- 
cluding phases that may depend on the same condition, such as P>Q. If the conditional 
test is simple, we would usually repeat the test in each phase when programming in a 
high-level language. If the conditional test is more complex, we might instead define a 
Boolean variable that represents the sense of that condition. Then the repetitious condi- 
tional tests would simply test that Boolean variable for truth/falsity. 

At the assembly language level, such situations occur somewhat more frequently. 
For compatibility with VAX MACRO, the MACRO-64 assembler provides three “sub- 
conditional” directives which, like . else, can legitimately occur only inside a condi- 
tional block bounded by .if and . endc directives: 


sete defined SYM 

Range A ; Assembled 

.if false 

Range B ; Not assembled 
if true 

Range C ; Assembled 
.if_true_false 

Range D ; Assembled 
.endc 


Suppose instead that SYM is not defined when this .if block is encountered. Then 
ranges A and C will not be assembled, but ranges B and D will be assembled. 

A conditional assembly block may contain either one .else directive or arbi- 
trarily many occurrences of the . i f_x directives, but not both. 


One-Line Conditional Statements Using the .iif Directive 


A final type of conditional block is provided for those instances where the body of the 
block would be a single assembler statement. The . iif directive, called the “immedi- 
ate” IF conditional block, does not require any matching . endc directive. This direc- 
tive takes the form: 
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me he condition argument(s), statement 


where the rules for conditions and arguments are the same as for the . if directive. The 
condition must be separated from the argument(s) by a comma, space(s), or tab(s). The 
arguments must be separated from the statement by a comma. Examples look like this: 


SERE defined NEGATE, subq R31,R2,R2 ; Negate 
LAR greater B D adda R2,1;R2 ; Increment 


The subg instruction is assembled if and only if the symbol NEGATE has been previ- 
ously defined. The addq instruction is assembled if and only if the previously defined 
symbol B has a value that is greater than 5. (If symbol B has not been defined, the 
assembler will generate an error condition.) 


Macro Processing 


The principles of good programming design and the motivation to enhance programmer 
productivity and accuracy emphasize such concepts as abstraction, encapsulation, and 
controlled adaptation. Facilities accessible to the assembly language programmer 
include subroutines, procedures, repeat blocks, macros, and lexical preprocessing. 

In Chapter 7, we studied the design and uses of subroutines and procedures at the 
assembly language level. Only one copy of such a routine exists in the linked execut- 
able program. Control passes to the routine whenever it is called, and it produces effects 
that can be mediated by the particular parameters passed. Because these routines con- 
form to a prescribed calling standard and exist in only one copy, subroutines and proce- 
dures are sometimes called closed routines (see Levy). 

Earlier in this chapter, we also studied the design and uses of repeat blocks which 
merely replicate code segments with, at most, one substitutable parameter (using . irp 
or- irpo). 

Macros offer capabilities somewhat midway between those of subroutines and 
those of repeat blocks. Like subroutines, macros may have numerous parameters. Like 
repeat blocks, they insert explicit code expansion in-line. Because macros expand into 
explicit machine instructions at the place of invocation and require no linkage mecha- 
nism to pass values or control, macros are sometimes called open routines (see Levy). 

Macros can be very simple, consisting of just a few lines, or they can be highly 
complex, encompassing dozens of parameters and hundreds of lines with nesting. The 
possible applications of macros are very extensive, including the following purposes: 


e to make the program code more readable. Note that this is the same motivation as 
for some of the proceduralization in high-level languages, e.g., Pascal; 


_—_ eee eee 
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e to implement a “missing instruction” through emulation in an appropriate and 
potentially conditional sequence of actual machine instructions supported by the 
architecture; 

° to define an emulation of another machine’s instructions (this may be rather diffi- 
cult); and 

e to create data structures and initialize them statically. 


We will illustrate some of these and other applications throughout the text portion of 
this chapter and its exercises. 


Defining a Macro 


A macro must be defined before it can be used. A macro definition begins with the 
-macro directive and ends with the . endm directive. All intervening lines comprise 
the body of the macro, which may contain arbitrarily many occurrences of formal 
parameters that are specified with the .macro directive, like this: 


.macro name formal_parameter_list 


range of lines containing assembly language statements or assembler 
directives 


.endm name 


where name is any legal symbol of as many as 31 characters and the formal parameter 
list consists of symbols, separated by commas, which are to be replaced one-for-one by 
actual parameters when the macro is called or invoked. Although the macro name is 
optional on the .endm directive, the assembler provides this facility to verify and 
enforce the correct nesting of macros. 

The range of lines bracketed by .macro and .endm will be subjected to text 
substitution of the actual parameters wherever their counterpart formal parameters 
occur in any field of instructions (or directives). Each occurrence is subjected to the 
same text substitution. The process resembles the find and replace operation within a 
word processor. 

When the assembler comes upon a .macro directive, it adds the macro name to 
an internal table of macro names and stores the verbatim text of the macro body up to 
the matching .endm directive. Further processing is deferred until the macro is 
expanded when invoked. 

Macro names do not need to be unique from other symbols in a program, because 
the assembler stores them in a dedicated internal table. Similarly, the formal parameter 


a aalala 
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names are associated with a particular macro name and thus do not need to be unique 
from other symbols or from the parameters for other macros. 

When macros are nested, the inner ones are not defined until the outer one is actu- 
ally invoked and expanded with text-substitution. If a macro invokes itself (recursively), 
its body must contain conditional tests and one or more .mexit directives that will 
prevent infinite recursion (limit of 1000 levels of nesting depth). 


Invoking a Macro 


A macro is used, or invoked, by putting its name in the operator field along with values 
called actual parameters in the specifier field, like this: 


name actual_parameter_list 


where the specified actual parameters will be substituted for the counterpart formal 
parameters. This substitution proceeds much like the search and replace operation with 
a word processor. 

Before presenting further details about macro parameters and the text substitution 
process, we show a simple example of a macro. Suppose that an experienced program- 
mer has been accustomed to architectures having clr as a machine instruction for 
clearing registers, i.e., ensuring that a register contains all zero bits. Although the Alpha 
architecture lacks such an instruction, we have already encountered situations in our 
sample programs where registers needed to be initialized to zero. At first we used mov 
0, Rn. Then we learned that mov is only a pseudo-instruction for which the assembler 
adapts some appropriate machine instruction. We have also discussed or used lda, 
bis, or a conditional move instruction based upon the properties of register R31 in 
each case. One such macro would be: 


.macro clr REG 
bis R31,R31,REG + REG <-- 0 
.endm elr 

A particular invocation of this macro would be 
cir R4 


which would expand to the Alpha instruction 


bis R31,R31,R4 > R4 <-- 0 


eee 
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where the text substitution of the actual parameter R4 has occurred for the formal 
parameter REG in both operand and comment fields. 

We can elaborate this macro, in order to zero more than one register, by allowing 
the parameter to be a list processed by an inner indefinite repeat block: 


macro clr reglist 
Be A os 2. REG, <reglist> 


bis R31,R31,REG ; REG <-- 0O 
. endr 
.endm clr 


The stored body of the macro will be a repeat block that takes the form 


. irp REG, <reglist> 
bis R31,R31,REG ; REG <-- 0 
. endr 


When we invoke this somewhat more complicated macro as, for instance, 
Cir <R4,R5> 


the formal parameter reglist is given the text string “R4,R5” as an actual value. 
The original angle brackets surrounding the actual parameter in the macro call are 
removed when that actual parameter is used, but the expansion of the macro puts the 
parameter list into the angle brackets which were part of the stored definition of the 
macro body for the . irp directive. The repeat block will then iterate its range as its 
single parameter REG takes on successive values in the specified set (first R4, and then 
R5 here): 


bis R31,R31,;R4 ; R4 <-- 0 
bis Roly Roi, RS ; R5 <-- 0 


If macros or repeat blocks are nested, invoking the outermost macro or repeat block 
causes all of the inner structures to expand in accordance with parameter substitution 
and subject to any conditionals therein. 


Processing of Positional Parameters 


Formal parameters in a macro definition and actual parameters in a macro call may be 
separated by commas, tabs, or spaces. The actual parameter values supplied when a 
macro is invoked are substituted strictly in accord with their positional relationships to 
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the corresponding formal parameters in the macro definition. The first actual parameter 
replaces all occurrences of the first formal parameter, the second replaces all occurrences 
of the second, etc. Null values can be passed by just using adjacent commas, like this: 


definition: .macro count ONE, TWO, THREE 
call: count 25, « 30 


In this schematic call, ONE will be text substituted with 25, THREE will be text substi- 
tuted with 30, and TWO will be text substituted with nothing. The null string must be 
acceptable to the assembler in all the positions where it “substitutes” for formal param- 
eter TWO. 

A special assembler directive is provided to count the number of positional param- 
eter values, including null values, that appear in a macro call: 


.narg CNT 


The .narg directive is legal only inside a macro definition. Suppose we wish to write 
a rather artificial macro to find the “average” of up to seven actual parameter values: 


.macro ave ONE, TWO, THREE, FOUR, FIVE, SIX, SEVEN 
.narg DIV 


SUM = D 

sirp TERM, <ONE, TWO, THREE, FOUR, FIVE, SIX, SEVEN> 
sekt; NB TERM, SUM = SUM + TERM 

. endr 

MEAN = SUM/DIV 


.endm ave 


The indefinite repeat block permits each macro parameter to be tested, in turn, in order 
to determine whether it is non-blank (i.e., non-null) or whether it represents another 
numeric string for the sum. If an actual call is 


ave 20, 40, n Bay n BO 


the symbol MEAN will be given the value 32 = (192/6), not 48 = (192/4). 


Processing of Default Values and Keyword Parameters 


When a macro has numerous parameters, and especially when some of these are 
optional or usually have default values, the positional formalism becomes awkward. 
MACRO-64 provides a keyword formalism for greater convenience. 
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Default values are specified simply by putting “=value” after the name of any for- 
mal parameter in a macro definition. Suppose that we modify the simple clr macro: 


-macro clr REG=R4 
bis RL. R31, REG ; REG <-- 0 
.endm Lr 


With this modification, invoking the macro without any explicit actual parameter, 
ele 
would result in an expansion using the default value R4 
bis R31,R31,R4 ; R4 <-- 0 
while invoking the macro with any actual parameter, such as 
CLE RS 
would result in an expansion using the actual value supplied: 
bis Rji,R31L,R5 ; R5 <-- 0 


In summary, the capability to associate a default value with one or more of the formal 
parameters when a macro is defined leads to simplification of the macro call but results 
in the same expansion as if the default value had been explicitly given at the time of the 
macro call. This feature is extensively used within the system-supplied macros such as 
Scall and Sroutine. While the use of such defaults offers convenience and 
enforces a certain level of standardization, the default values are “invisible” if one is 
only reading the macro call statement itself. 

Keyword parameters are specified by putting “=value” after the name of any for- 
mal parameter in a macro call. Values for parameter can be given in any order when 
keywords are used, like this: 


definition: .macro count ONE, TWO, THREE 
call: count THREE=30, ONE=25 


In this schematic call, ONE will be text substituted with 25, THREE will be text substi- 
tuted with 30, but TWO will be text substituted with nothing. 

Keyword and positional parameters may be mixed, but this practice may reduce 
the readability of a program. Usually the first few positional parameters are mandatory, 


I 
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and then numerous additional parameters may be optionally designated by keyword (in 
any order) in a call. Examples include $routine and $call, where the mandatory 
first parameter (name) is usually specified in a positional fashion. 

Neither any “missing” parameters whose values are provided by default values nor 
any parameters whose values are supplied by keyword are counted by the .narg direc- 
tive, which strictly counts only positional parameters. Let us return to the previous 
example of the macro ave. If an actual call is 


ave 20, 40, FOUR=52, SIX=80 


the symbol MEAN will be given the value 96 = (192/2), not 48 = (192/4). 


Processing of String Parameters 


Simple strings consisting exclusively of alphanumeric characters can be passed as 
actual macro parameters without special concern. MACRO-64 preserves upper or 
lower case alphabetic characters, although assemblers for other architectures might 
convert strings to upper case by default. 

Strings which contain any of the characters that normally separate actual macro 
parameters (commas, tabs, spaces) must be enclosed within delimiters when they are to 
be passed as actual parameters to a macro. Three types of delimiters may be used: 


1. Quotation marks. The matching quotation marks are inseparately bound to the 
enclosed string as part of the actual parameter for substitution. An example of a 
macro definition, macro call, and result would be: 


Definition Call Result 
.macro asc STR 
,asciz STR asc "A ba" ,ascizg "A be" 


. endm asc 


2. Angle brackets. The angle brackets are removed from the actual parameter. There- 
fore, the macro body must be defined with inclusion of appropriate delimiters. An 
example of a macro definition, macro call, and result would be: 


Definition Call Result 
.macro asc STR 
.asciz " STR” asc <A bc> .asciz "A. be” 


.endm asc 


3. Arbitrary delimiters. The circumflex character can be used as a unary operator to 
signal that the next character performs a quoting role. That next character can be 
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anything except a, b, c, d, o, or x (see Tables 3.2 and 3.5). The delimiters are 
removed from the actual parameter. An example of a macro definition, macro call, 
and result would be: 


Definition all Result 

-macro asc STR 

<asciz "STR" asc “$A bc% ,asciz “A be 
.endm asc 


Only the left-hand delimiter is preceded by a circumflex character. 


Care must be taken when planning nested or recursive macros in order to ensure that 
exactly one set of delimiters will be present for final substitution into any .aScix 
directive. 

A special directive, .nchr, may be used in macros to determine the length of an 
actual string parameter: 


macro stchrs STR 
.align quad 

-nchr CNT ,<STR> 
.quad CNT 

2468011. "STP" 

.endm stchrs 


When this macro is called using STCHRS <3 + 2 = 


-align 
snehr 
. quad 
:ascii 


5>, the expansion will be 


quad 

CNT, <3 + 2 a 5> 
CNT 

"3 4 2a 5° 


where CNT will have the value 9 because the string parameter contains 5 visible charac- 
ters and 4 spaces. (The lines containing .align and .nchr will occur in the listing 
file only if expansions are being shown in full.) 

A special operator, the backslash charcter (\), forces an actual parameter to be sub- 
stituted as the string of decimal digits representing its previously defined value instead 
of its name. Consider this example: 


Definition Call Result 
-macro lw NUM NUM = 35 
. Long NUM lw \NUM . Long 35 
.endm lw NUM = -2 

lw \NUM . Long -2 
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The backslash operator is principally used in conjunction with the operator for parame- 
ter concatenation which we introduce next. 

The apostrophe character (') has a special meaning within a macro definition 
when it adjacently precedes or follows the name of a formal parameter. During expan- 
sion of the macro, the apostrophe is removed and the actual parameter is “glued” in 
place of the formal parameter. Recall that parameters are usually separated from other 
non-substitutable text material by a comma, tab, or space. When the apostrophe charac- 
ter acts as a concatenation operator, such separators are not needed to demarcate 
parameters. 

As a first example, suppose that we wish to build an array of bytes containing the 
ASCII codes for the ten numerals and also provide a corresponding symbolic address 
for each. The following indefinite repeat block will accomplish what we want: 


Arpe CHAR, <0123456789> 
digit 'CHAR: .byte “a" CHAR" 
. endr 


because the parameter CHAR successively becomes each numeral in the order specified. 

As a second example, we can use concatenation in a reworking of the clr macro, 
where a list of register numbers is expected in this new clear macro instead of the list 
of register names expected previously: 


.macro clear numlist 


LED NUM, <numlist> 
bis R31,R31,R'NUM > R'NUM <-- 0 
. endr 


. endm clear 


The stored body of the macro will be a repeat block that takes the form 


AED NUM, <numlist> 
bis R31, R31, R'NUM > R'NUM <-- 0 
. endr 


When we invoke this macro with a list of two numbers as, for instance, 
clear a ,5> 
the net result of the expansion will be the same as for the original version, namely, 


bis R31,R31,R4 ; R4 <-- 0 
bis R31,R31,R5 > R5 <-- 0 
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Although this example may seem contrived, the concatenation concept can be 
extremely powerful in the development of complex macros. 


Generating Unique Labels with Macros 


Labels within a program must be unique, in order that the assembler may refer to data 
locations or branch targets uniquely. Even the use of temporary labels would prove 
problematic in macro expansions. Since much of the utility of macros derives from 
being able to invoke any macro at multiple points in a routine, clearly any hard-coded 
labels within macros would become multiply defined and no longer unique. 

Some assemblers, including MACRO for the VAX and MACRO-64 for the 
Alpha, offer a means of assuring uniqueness of any labels defined in a macro expan- 
sion. In the macro definition, any formal parameter that begins with the question mark 
character (?) marks that parameter for use as a place-holder label throughout the macro: 


-macro equal PAR1, ?JMP 


ldq R3, PAR1 ; Get PAR1 to test 
bne R3 , JMP 
what to do if R3=0 

JMP: what to do regardless of value in R3 


.endm equal 


The recommended practice is not to supply an actual value for the parameter JMP, 
because then MACRO-64 will create a temporary label and guarantee its uniqueness. 
For instance, if equal is called twice with only an actual parameter for PAR1, 


equal DATA1 
equal DATA2 


the expansion will be of the form 


ldgq R3,DATA1 ; Get DATA] to test 
bne R3,30000S$ 
what to do if R3=0 

30000$: what to do regardless of value in R3 
ldq R3 , DATA2 ; Get DATA2 to test 
bne R3,30001$ 
what to do if R3=0 

30001S: what to do regardless of value in R3 


MACRO-64 creates temporary labels in the range 30000$ to 65535$. Consequently, 
you should refrain from using this range for hand-coded temporary labels. 
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If the programmer supplies an actual value for a formal parameter that begins with 
a question mark when calling a macro such as equal, then MACRO-64 will use that 
value: 


equal DATA3 ,400$ 


in the resulting expansion 


ldg R3,DATA3 e Get DATA3 to test 
bne R3,400$ 
what to do if R3=0 

400$: what to do regardless of value in R3 


The burden for assuring uniqueness then rests with the programmer, who still obtains 
all of the other benefits of macro use. 


Self-Redefining and Recursive Macros, .mdelete, and 
.mexit 


Macros may redefine themselves. In recursive instances, the macro must contain a con- 
ditional directive that will terminate the recursion. 

When a macro is redefined (or when it redefines itself), the previous definition is 
supplanted by the current one. The special directive, .mdelete NAME, removes the 
macro called NAME from storage entirely. 

Macros may invoke themselves recursively. Consider the factorial function, imple- 
mented as a recursive macro: 


.macro factorial INT 
.if EQUAL <INT>,0 : Test for end of recursion 
Poe J =< using 0! = 1 
.else 
factorial <INT - 1> > Call itself recursively 
F = F * <INT> >» using N! = N * (N-1)! 
.endc 


. endm factorial 


We encourage you to trace the effect of invoking this macro with a small value of INT, 
such as 4. 

The .mexit directive can be used to exit unconditionally from further expansion 
of a macro. The .mexit directive exits just one level of macro expansion (for nested 
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macros). This directive is useful in implementing case structures (i1.e., mutually exclu- 


sive conditions): 


.macro choice 
.1f GT <WHEN>,1984 
; case for FUTURE 
.mexit 
.endc 
.if EQ <WHEN>,1984 


WHEN 


; case for THAT YEAR 


.mexit 
.endc 
.if EQ <WHEN>, 1983 
; case for YEAR BE 
.mexit 
.endc 
.IF LT <WHEN>, 1983 
; case for EARLIER 
.mexit 
.endc 
.endm choice 


FORE 


You should ask yourself why the parameter WHEN is enclosed in angle brackets wher- 


ever it is used in the body of thi 
“simple” value?) 


s macro. (Hint: Does an actual parameter have to be a 


Controlling the Listing File 


The MACRO-64 assembler provides several directives that afford the programmer a 
certain amount of control over the contents and appearance of the listing file: 


.page 


.title mame phrase 


.Sbttl phrase 
. show option(s) 
.noshow option(s) 


List 
nlist 


inserts a form feed character and advances the page 
number in the listing file. 


gives the program module a name and a title phrase, 
which is reproduced at the top of each page in the 
listing file. 


defines a subtitle phrase, which is reproduced beneath 
the title line at the top of each page in the listing file. 


enables a level of detail of particular components and 
macro expansion to be shown in the listing file. 


suppresses particular components from being shown 
in the listing file. 


is a synonym for . show (PDP-11 and VAX heritage). 
is a synonym for .noshow (PDP-11 and VAX heritage). 
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These listing control directives have no functional effect upon the contents of the 
machine-code object file. 

The . show and .noshow directives may contain one or more of the symbolic 
options in Table 9.2. By default, each of these options is inactive (i.e., set to “noshow”’). 


Table 9.2 Options for .show and .noshow Directives 
Long Form Short Form Function 


binary meb Lists the expansions only of those lines in macros and 
repeat blocks that generate binary code (or data), i.e., those 
lines that advance a location counter. The binary option 
is a subset of the expansions option. 


conditionals cnd Lists unsatisfied conditional sub-blocks associated with 
. i f blocks. 
expansions me Lists the complete expansions of macros and repeat 
blocks. 
library (none) Lists the macro definitions coming from a library. 
include (none) Lists the text coming from an include file. 
eee ED 


When the . Show and .noshow directives are used without options, they incre- 
ment and decrement a “listing level” counter that is initially zero. When this counter IS 
negative, lines are suppressed from the listing file (except for lines that generate an 
error). When the counter is zero, lines are either shown or suppressed depending on the 
prevailing . show/ .noshow options. When the counter is positive, lines are uncondi- 
tionally shown. Consider this example: 


.noshow = Counter < 0 
+ These lines are suppressed from the listing file 
. show > Counter = 0 


- These lines display according to .show options 
. show > Counter > 0 
: These lines always display regardless of .show options 


Macros may contain . noshow at the beginning to decrement the listing level counter 
and . show at the end to restore the listing level counter to its original value. In this 
way, the expansion of complex macros may be listed selectively. 


Defining Additional Program Sections with .psect 


We have previously seen that the system-supplied $routine macro (discussed in 
Chapter 7) defines three program sections named ScodeS, Sdatas$, and $links. 
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When we look for these on a link map, we can see that they have different attributes. 
The code section is marked with the attributes EXE and NOWRT, the data section is 
marked with NOEXE and WRT, and the linkage section is marked NOEXE and NOWRT. 

The program sectioning directive .psect can be used to set up additional pro- 
gram sections: 


-psect name, alignment, attribute(s) 


Every program section has its own location counter value that the assembler maintains 
and associates with a distinct address origin corresponding to the program section 
name. The names of program sections, like the names of macros, are stored in a unique 
way within the symbol tables during the assembly process and may thus duplicate 
macro names or symbols. 

Additionally, the .psect directive can be used to switch the active location 
counter to that of a different previously defined program section: 


-psect earlier 
; code or data located relative to 'earlier' origin 
-psect another 


; code or data located relative to ‘another' region 


The binary object file produced by the assembler contains sufficient information to permit 
the linker to collect and organize all the fragments belonging to each program section. 

The alignment of a program section can be specified as BYTE, WORD, LONG, 
QUAD (default), or OCTA in order to instruct the linker to align the origin of the section 
at the next available virtual address that is evenly divisible by 1, 2, 4, 8, or 16, respec- 
tively. The system-supplied $rout ine macro uses OCTA ali gnment for the three stan- 
dard program sections. Generally a large value of the alignment parameter should be 
used because in most implementations that will give optimal performance as the system 
accesses instructions and data. The alignment value specified in the definition of a pro- 
gram section sets an upper bound for any . align directive (Chapter 7) that may occur 
in the section. 

The capability to specify numerous program sections has been provided in the 
assembler for several reasons: 


* to support the development of modular, more easily maintained programs; 
e to separate instructions from data; 

e to protect instructions and read-only data from being modified; 

e to prevent any attempted fetching of data as instructions; and 
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e to allow sharing of instructions (e.g., memory-resident shareable code libraries) or 
data in advanced applications. 


Moreover, some high-level languages (e.g., FORTRAN) make use of program sections 
for various “named” storage allocation regions. 

These capabilities require the appropriate specification of several attributes for 
each program section, as summarized in Table 9.3. 


Table 9.3 Attributes for Program Sections 
Attribute Opposite Function 


EXE (default) NOEXE Executable section containing only instructions (EXE); or non- 
executable section containing only data (NOEXE). 


MIX NOMIX Mixed section that can contain both instructions and data (MIX); 
(default) or nonmixed section containing either instructions or data, but not 
both (NOMIX). 


WRT (default) NOWRT Writable section that can be altered at execution time (WRT); or 
nonwritable section that cannot be so altered (NOWRT). 


RD (default) NORD Readable or nonreadable section (reserved for future use by Digi- 
tal Equipment Corporation). 


REL (default) ABS Relocatable section which can contain instructions or data at suc- 
cessive locations relative to a relocatable base address associated 
with the section name (REL); or the so-called absolute section 
which never contains instructions or data but is a device used within 
some system macros to define certain symbolic constants (ABS). 


LCL (default) GBL Local section most frequently used (LCL); or global section used 
by FORTRAN for named COMMON blocks (GBL). 
SHR NOSHR Shareable section (SHR) which can be shared at execution time by 


(default) multiple processes (not discussed in this book). 


PIC NOPIC Position-independent content section (PIC), which applies only to 
(default) the construction of shareable images (not discussed in this book). 
CON (default) OVR Concatenated section into which the linker will collect contents 
from multiple program modules and arrange them end-to-end in 
the order encountered (CON); or overlaid section into which the 
linker will restart laying contents from multiple program modules 
starting from the same base address for each (OVR). 


a mmm oatmeal lim taomsat 

For the CON attribute, the highest allocated virtual address for a section is the sum 
of the individual allocations requested in all contributing modules. For the OVR 
attribute, the highest allocated virtual address for a section is the largest allocation 
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requested by any one contributing module. Digital BASIC uses the OVR attribute for its 
MAP statement, which may occur several times in order to provide multiple symbolic 
offsets that refer to the same actual data storage locations, e. g., a block of storage to be 
accessible as either a 512-element byte array or as a 64-element quadword array. 

In a mixed-language programming environment, the linker will enforce that a 
given program section has been specified with the same attributes in all modules where 
it is used. For example, if we wish to refer to a Pascal array in some other routine, we 
must match the attributes set by the high-level language for the pertinent program sec- 
tion which allocated the array. 


MONEY: a Macro Illustrating Program Sections 


We mentioned earlier in this chapter that macros can be used to create data structures 
and to populate them with values statically. Most real-world programs contain a lot of 
never-altered internal data such as captions, help systems, and error messages. Often 
such data are most conveniently accessed if they are organized as elements in arrays or 
more complex data structures. In addition, such data may need to be reformulated into 
various natural languages in support of program versions to be used internationally. 
High-level languages have come to provide some capability for populating such struc- 
tures statically, i.e., at compile time rather than at run time. One example is the VALUE 
section in a program written in Digital Equipment Corporation’s extension of the Pascal 
language. We now present another approach that could package such variant data as 
“parallel arrays” in interrelated program sections using the assembler. 

Consider a hypothetical company that wishes to develop its check-writing soft- 
ware with an eye toward support of multiple currencies and natural languages. It is rela- 
tively easy to change from the n,nnn.nn convention for writing values with numerals in 
English-speaking countries to the n.nnn,nn convention in German-speaking countries; 
indeed, current versions of spreadsheet programs like Microsoft Excel do this with an 
easily changed setting. By convention, however, a check also contains the amount 
“spelled out” in words: 


One thousand four hundred twenty two and 37/100 US dollars 
Or 
One thousand four hundred twenty two and 37/100 UK pounds 


Let us outline a universal algorithm (requiring a country-specific static data structure) 
for constructing such phrases. 
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Essentially, solving this problem proceeds in the reverse of the way we first 
learned to count. We will start with some suitable largest value (here, 1000 for simplic- 
ity). Imagine that the following data structure somehow already exists: 


valul[ ] texti | phrase 
1000 o--> one thousand 
900 Oo--> nine hundred 
20 o--> twenty 
19 o--> nineteen 
2 oO--> two 
1 o--> one 


where the text[ ] column contains address pointers. It is not difficult to propose 
having corresponding populated data structures for French, German, etc., although the 
number of rows might differ for each because most every language “counts” idiosyn- 
cratically. 

Suppose we are given an amount (i.e., a number), a unit of currency, and the 
word and in the natural language. The following algorithm written in a make-believe 
dialect of BASIC will suffice: 


rest = amount 
FOR i = 0 STEP +1 UNTIL rest < 1 

IE valu[i] <= rest 

THEN PRINT AS STRING @text[1i] 

rest = rest - valu[i] 

END IF 
NEXT 1 
PRINT AS STRING and 
PRINT AS FRACTION rest 
PRINT AS STRING SINGULAR_PLURAL( unit, INT(amount) ) 


We ignore small details such as capitalization, commas, and spaces. The algorithm sim- 
ply passes down through the tabular data structure using successive subtraction. The 
“@” symbol implies that array text [ ] contains addresses, not the strings themselves. 

How can the required data structure with parallel arrays value[ ] andtext[ ] 
be constructed in a language-independent way? The values will be, say, quadwords. The 
address pointers must also be quadwords on an Alpha. The .asciz directive might be 
appropriate for the phrases, since we may suppose that the terminating null byte can be 
used by PRINT AS STRING to detect the end of any phrase being processed. We intro- 
duce a macro that references three program sections: 
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.-macro money AMT, PHRASE 
.psect WORDS 
STAGS Z i ; Current location counter 
.asciz / PHRASE/ ; Null byte at end 
-psect VALUES 
. quad AMT ; Numerical amount 
.psect WHERE 
.address STAGS 
. endm MONEY 


If this macro is invoked over and over, the reusable symbol $TAG$ will have a different 
magnitude each time, namely, the particular address offset from the beginning of pro- 
gram section WORDS where the first character of each corresponding counting-scheme 
phrase has been put. The linker will add a base address to all such offsets when it estab- 
lishes a final virtual address for every program section. We thus produce three parallel 
arrays: quadwords, address pointers, and variable-length strings. 


. title USDOLLARS 

-include "MONEY.M64" ; MONEY macro definition 
; Define program sections 

-psect WORDS BYTE, NOWRT, NOEXE 

-psect VALUES QUAD, NOWRT, NOEXE 
Valu: 

.psect WHERE QUAD, NOWRT , NOEXE 


; Define all the phrases 
; (insert new ones at top if monetary inflation occurs) 


money 1000,<one thousand > 
money 900,<nine hundred > 
> etC. 
money 100,<one hundred > 
money 90,<ninety > 
7 etc, 
money 20,<twenty > 
money 19,<nineteen > 
$ BEC. 
money 2 ZEWO > 
money 1,<one > 
.end 


The order of occurrence of the various invocations of the money macro, specifying 
ever-smaller amounts, produces the ordered data structure. The resulting object module 
would be linked with the main program, the latter being written in any convenient pro- 
gramming language. 
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Summary 


Assembler programs have evolved to a point where numerous conveniences for the pro- 
grammer have been added to the indispensable primary functions of formulating 
machine instructions from mnemonic statements and interrelating symbolic addresses 
for both instructions and data. MACRO-64 exemplifies a rather typical suite of capabil- 
ities that encompasses repeat blocks, conditional assembly, and macros. When these are 
intermixed and nested, quite exquisite control may be exercised over the details in an 
assembly language program. 

In particular, greater control over storage allocation and establishment of interrela- 
tionships among data is both necessary and possible at the assembly-language level as 
contrasted to the high-level language level. We have emphasized this point through a 
discussion of program sections. 
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EXERCISES 


9.1 Extend the SQUARES program in Chapter 1 in order to compute a list of values of 
one of the following polynomials for integer values of N from 1 through 50, again 
without using explicit multiplication instructions. Hint: Use a repeat block. 


a N24N 
b. 2N2-1 
c. N2-N+2 


Store the function values in memory locations. Use the examine command of the 
debugger to inspect the results. 


9.2 Use a repeat block to construct a data section containing 25 instances of a record 
structure that provides 5 quadwords for a part number, a quantity, a cost, a selling 
price, and a net profit and then 40 bytes for a description. Note that the resulting 
memory region comprises an array having one row per part and with six columns for 
the specified fields of information. 


9.3 Use a repeat block with . irp to construct a data section containing five successive 
25-element vectors labeled NUMBER, COUNT, COST, PRICE, and PROFIT. Then 


TT eeeeeeeeeeeeSFSFSeSSeSeseseseseseseF 


Summary 285 


allocate a space at label NAME for 25 strings of 40 bytes in length. Draw a sketch that 
shows how this data region differs from the one which would result from exercise 9.2 
above. 


9.4 Show how the .irpc example for vowels could be modified to determine that a 
character is not a vowel. Add further instructions outside the repeat block so that the 
entire instruction sequence will lead to AHEAD with the value 1 in register R1 if the 
character is a consonant, the value 0 if it is a vowel, and the value -1 otherwise. 


9.5 Distinguish among the multiple meanings of the word macro: name of a DCL com- 
mand, name of a system-supplied program, name of an assembly language, name of 
a prearranged but parameterized sequence of instructions. 


9.6 The VAX instruction set contains both 3-operand arithmetic instructions for long- 
words, which are identical in syntax to the Alpha instructions for longwords, and 2- 
operand arithmetic instructions that overwrite the second operand with the result. 
Write macros named like the VAX instructions which would emulate these instruc- 
tions on the Alpha for operands in registers. (The actual VAX instructions can also 
use operands in memory via numerous addressing modes, but such situations would 
require sequences of instructions on a RISC processor.) 


a. add13 and add12 
b. sub13 and sub12 
c. mull3 and mull2 


9.7 Write simple macros for the following “missing instructions” in the Alpha instruc- 
tion set: 


a. NEGate Rx,Ry (i.e., Ry = two’s complement of an integer in Rx). 
b. COMplement Rx,Ry (i.e., Ry = one’s complement of an integer in Rx). 
c. ABSolute_value Rx,Ry (i.e., Ry = absolute value of an integer in Rx). 
Does one of these require a macro-generated label? 

9.8 Write a macro containing the .narg and . irp directives, as follows: 


a. PUSHREGS takes a parameter list of integer register names whose contents are to 
be preserved on the stack. A single 1da instruction should reserve stack space. 
An internal symbol initialized to zero can be used to store the data at successive 
offsets (0, 8, etc.) from the predecremented SP register. 


b. POPREGS macro that would be the complement of PUSHREGS. 


9.9 Explain how the macro 


.-macro doit ARITH,A,B, LENGTH=Q 
ARITH' LENGTH A,B,B 
.endm doit 
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will expand in the following cases of invocation: 


4, Goit sub R1,R2,L 
b. doit mul R1,R2 
c. doit add Ri. R20 


9.10 Explain the difference between these invocations of a macro: 
a. load Data 
b. load \Data 
given Data = 6 and the following macro definition: 
.macro load Value 
ldq RO, Base'Value 
.endm load 
What symbol(s) must be defined elsewhere in the program for parts a and b? 
9.11 Write simple macros for swapping integers, as follows: 


a. Macro SWAP assumes that the two register parameters contain the quadword 
data. 


b. Macro SWAPI (swap indirect) assumes that the two register parameters point to 
the quadword data. 


Assume that register R22 can be used for scratch purposes. 
9.12 Write out all the lines needed to alter USDOLLARS appropriately for: 
a. up to 1999 UK pounds 
b. up to 1999 French francs 
c. up to 1999 Deutschmarks 
d. up to 1999 units of some other country’s currency 


9.13 Develop an algorithm, a data structure, and a macro that would support the writing of 
numbers as Roman numerals. If your instructor so requires, pursue this exercise as a 
complete program that prompts for a value and prints it as an upper or lower case 
Roman numeral. 





CHAPTER 10 


Input and Output of 
Text 


T he organization of a complete computer system 
includes some components dedicated to input and output of data. Such I/O components 
are connected to the central processor and memory by bus structures (see Figure 2.1). 
Although this book concentrates on the architecture of the CPU itself and the instruc- _ 
tion set, we include in this chapter a treatment of input and output from a programmer’s 
perspective at a relatively high level which leaves consideration of the physical devices 
(including bus protocols and timing) to books on computer organization, detailed expla- 
nation of file systems (including directories and security) to books on operation sys- 
tems, and description of internal file organizations (e. g., Sequential or indexed) to books 
on data structures and database technology. 


The number and nature of commands required to control the operation of each 
type of physical device vary considerably, whether for non-storage devices like displays 
and printers or for storage devices like tape transports and disk drives. A magnetic tape 
unit needs commands to start, stop, and reverse the motion of the tape as well as to read 
or write data. The layout of data on a tape is usually organized sequentially in blocks 
comprising hundreds to thousands of bytes. 


A disk unit might be designed to keep spinning all the time, although start and 
stop commands are supported for disk devices in a notebook computer and all types of 
removable disks. The layout of data on a disk is usually organized along concentric cir- 
cles or spiral pathways at the fundamental physical level, with some appreciable frac- 
tion of the total storage capacity devoted to error correcting codes alongside the data 
proper. With RAID technology (redundant arrays of independent devices), data belong- 
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ing to a single file may be actually stored across several distinct hardware elements. The 
hardware or firmware of the disk unit or array is designed to hide the detection and cor- 
rection of errors and thus present to an operating system the abstraction of an idealized 
storage space upon which a file system can be reliably based. 

In turn, the file system and other aspects of a programming environment establish 
higher levels of abstraction (Figure 10.1) where a program can create or access a data 
file without concern about how and where it is physically stored. Most operating sys- 
tems manage their file systems in ways that permit large files to be broken up into seg- 
ments that are not necessarily stored contiguously. A logical file is always intact, 
however, and access to it by a program can proceed without regard for underlying dis- 
continuities. 


File system layer(s) of operating system 


Logical file (uniform address space) 


File implementation layer(s) of operating system 






Directory file 


Physical file (perhaps not stored contiguously) 


Device drivers and hardware controllers 


Raw storage (including error correction overhead) 


Figure 10.1 Schematic relationship between logical and physical file storage 


In this chapter, we present enough detail about the interface to some of the func- 
tions in the standard C library to make possible the writing of complete programs in 
Alpha assembly language that perform sequential input and output using both the “ter- 
minal” and text files. 
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File Systems 


Early operating systems had file systems that extended only minimally the level of 
organization of data exhibited by trays of punched cards or by reels of magnetic tape. 
Not so now. DOS, a quite simple single-user operating system, supports subdirecto- 
ries and a consistent, albeit very limiting convention for naming files stored on disk. 
Windows 95/98, Windows NT, and the Macintosh operating system all support longer 
file names in a manner that is case-preserving for display but case-insensitive for pur- 
poses of searches and enforcement of uniqueness. These Systems provide ways to 
view the cataloguing of files by directory and subdirectory using either text or graph- 
ics (i.e., folders). 

Operating systems intended for shared use extend those concepts with some 
means of designating individual and/or group “ownership” of directories and files, with 
a matrix of permissions that the system can match against the identity of the preson, 
process, or program seeking access to particular information. Names of directories and 
files in OpenVMS are neither case-sensitive nor case-preserving. Names of directories 
and files in Unix are both case-sensitive and case-preserving. Both of these operating 
Systems store ownership and access protection information in the directory data 
attached to every file (and to every directory, since directories are themselves imple- 
mented as special types of files). In a networked environment, data on both Windows 
95/98/NT and Macintosh systems can also be marked for selective sharing on a case- 
by-case basis. Some operating systems can Support shared “update” access whereby 
more than one process can modify a file using algorithms that make explicit any con- 
flicts or ambiguities arising from near-simultaneous updates. 

Multi-user operating systems also typically offer to the system manager some way 
to limit the absolute amount of storage claimed by any one user. In such cases, the stor- 
age taken up by directories themselves and any hidden pointers required to catalog, 
point to, or link the disjoint portions of physically fragmented files are all tallied against 
the owner’s storage quota. A file system has to maintain a dynamic, comprehensive 
inventory of space allocation, marking regions of the disk as used when files grow and 
putting such regions back into some form of a free list when files are deleted. 

When a new file is created, many pieces of information may have to be specified 
(if system defaults are not appropriate), such as file name, owner, protection, access 
mode, or internal file organization. The system will have to apply and record such infor- 
mation, as well as assure access to the amount of actual data storage requested either 
up-front or as-needed. 

Clearly the software components required to support a full-featured file system 
are not simple. Moreover, such system software is usually defined and provided in “lay- 
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ers” with increasing levels of abstraction, but often with some ability for programmers 
to drill down and use low-level routines if they wish. Figure 10.2 depicts the situation 
for programs we might write in Alpha assembly language in the Unix or OpenVMS 
programming environments. 


Program in Unix programming environment 


creat 
open 
etc. 










C standard |/O library functions 


Unix I/O system calls 





Logical data storage and device control 


Physical data storage and device control 








Program in OpenVMS programming environment 


decc$fopen creat 


etc. 
etc. 


C standard I/O Emulated Unix I/O 


Record management services (RMS macros) 


OpenVMS system services (QIO macros) 
Logical data storage and device control 
Physical data storage and device control 


Figure 10.2 File system layers seen by a C-like program on an Alpha 


IO$_create 
TO$_access 
etc. 
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Unix I/O Software 


For a Unix system, the appropriate device controllers and the requisite components in 
the operating system kernel will manage physical data storage. Other parts of the oper- 
ating system will regulate access to storage devices and manage data storage at a logical 
level. That is, a file will logically correspond to an idealized address space capable of 
growing to as many kilo-, mega-, or gigabytes as may prove to be necessary. 

The Unix environment defines certain I/O system calls such as creat and open 
for establishing and accessing files or read and write for moving bytes of informa- 
tion from and into files. Any particular Unix implementation may make such system 
calls available to a programmer using assembly language or a high-level language like 
C, and an arrow in Figure 10.2 indicates this additional type of access to be possible for 
Digital Unix on an Alpha. 

The C programming language, in the purest sense, was defined apart from consid- 
erations of input and output (this was true for the strictest definition of Pascal also). 
Nevertheless, the definition of the C language has been accompanied by a library defin- 
ing standard functions such as fopen for establishing and accessing files or fgets 
and fputs for moving bytes of information from and into files. Any particular Unix 
implementation may make such system calls available to a programmer using assembly 
language, and an arrow in Figure 10.2 indicates this additional type of access to be pos- 
sible for Digital Unix on an Alpha. 


OpenVMS I/O Software 


For an OpenVMS system, whether running on VAX (32-bit) or Alpha (64-bit) hard- 
ware, the appropriate device controllers and the requisite components in the operating 
system kernel will manage physical data storage. Other parts of the operating system 
will regulate access to storage devices and manage data storage at a logical level. That 
is, a file will logically correspond to an idealized address space capable of growing to as 
many kilo-, mega-, or gigabytes as may prove to be necessary. 

The OpenVMS environment defines “native” system services for queued input 
and output (QIO), which operate in a very sophisticated manner based upon precisely 
defined data structures through which a program submits requests that the kernel of the 
operating system then manages dynamically. 

The programming interface to this I/O system primarily takes the form of a single 
extraordinarily complex system-supplied macro called $QIO that contains numerous 
embedded defaults and consistency cross-checks to ensure that the required data struc- 
tures are built correctly. Individual operations are specified by means of a function 
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parameter to the $QIO macro, such as LO$_ create and I0$_access for establish- 
ing and accessing files or 10$_readx and LO$_writex for moving bytes of infor- 
mation from and into files (where x can specify physical or logical blocks of data). The 
QIO access methods are available to a programmer using assembly language or a high- 
level language like C, and an arrow in Figure 10.2 indicates this additional type of 
access to be possible for OpenVMS on an Alpha. 


Prior to the development of relational database technology, the more highly 
evolved operating systems in the computer industry had come to offer certain file 
types with a considerable measure of internal organization. A notable example is 
what IBM called the indexed sequential access method (ISAM files), where an index 
field permits the file to be accessed “sequentially” in the indexed order. Similarly, 
Digital Equipment Corporation developed record management services (RMS files) 
for two of the 16-bit PDP-11 operating systems (RSX-11M and RSTS/E) and for the 
32-bit VAX and 64-bit Alpha variants of OpenVMS. An RMS file may be indexed 
(ISAM-like) with multiple indexed fields, relative with uniquely numbered records, 
or sequential like an ordinary text file. The major concept for RMS files is that access 
is possible to individual records in the file, i.e., to aggregated information units 
embodying the programmer’s data structures. 


The programming interface to RMS files is a large set of complex system-supplied 
macros that contain numerous embedded defaults and consistency cross-checks. Some 
of those macros build data structures of highly encoded attributes related to a file and 
the types of records it contains (such as keyed or indexed fields). The data structures are 
linked together in prescribed ways. Individual operations are specified using other mac- 
ros that generate machine instructions, such as create and Sopen for establishing 
and accessing files or $get and $put for obtaining records from and putting records 
into files. The RMS access methods are available to a programmer using assembly lan- 
guage or a high-level language like C, and an arrow in Figure 10.2 indicates this addi- 
tional type of access to be possible for OpenVMS on an Alpha. 


Besides these two layered system-supplied I/O systems (QIO and RMS), we 
will consider another layer that offers a choice of emulated Unix I/O system calls or 
the standard C functions for I/O. We show these at the same level in Figure 10.2, 
since both are internally implemented directly using the native calls (QIO and/or 
RMS). A program which we might compose in Alpha assembly language thus has at 
least four options for performing its own I/O (Figure 10.1). And, as we showed in ear- 
lier chapters, encapsulating and calling upon support routines from one or another 
high-level language may also be feasible in the well-standardized OpenVMS pro- 
gramming environment. 
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A Simplifying Choice 


We have repeatedly cautioned that we intend this as a book centered upon Alpha archi- 
tecture, not a treatment of operating system features per se. Accordingly, we break here 
with our earlier progression of PDP-11 and VAX assembly language and architecture 
books. We will not discuss input and output using explicitly the QIO or RMS capabili- 
ties for the OpenVMS programming environment or using explicitly the Unix I/O sys- 
tem calls. Instead, we will cast our further discussion of input and output in this chapter 
in terms of selected aspects of the standard C library. 

In this regard, we need to alert those of our readers who may work in the OpenVMS 
programming environment that the entry point names of the C functions related to input 
and output all begin with decc$ and usually end with the standard names, but in certain 
instances contain other characters that would be hard to guess. The routines we have 
selected for discussion and possible use in sample programs are listed in Table 10.1. 


Table 10.1 Entry Point Names (OpenVMS) for Selected C Library Functions 
C Library Function Access Entry Point Name (OpenVMS) for IEEE Support 


fclose file deccSfclose 
fgets file deccSfgets 
fopen file deccSfopen 
Corint file deccStxfprintf* 
fputs file deccSfputs 
fscanf file deccStxfscanf* 
gets stdin deccSgets 
perror stderr deccSperror 
printf stdout deccStxprintf* 
puts stdout deccSputs 


scanf stdin deccStxscanf* 
oo eeeeeeeeeeeeeeSSSSSFSsSs 


* Different entry point names provide support for VAX-compatible floating-point numbers (not discussed 
in this book). 


Keyboard and Display I/O 


A system-supplied standard C library provides functions for performing text-oriented 
I/O from the standard input (stdin), normally the keyboard, and to the standard out- 
put (stdout), normally the display. (Errors display on stderr.) 
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Some of those functions may be provided in a “macro form” (see Kernighan) that 
is essentially C source code, located inside the <stdio.h> header file, which 
becomes part of the overall compilation input for a C program. On the Alpha, that is the 
case for getchar and putchar, which we therefore compiled on their own and 
linked to some of our sample assembly language programs from Chapter 6 onward in 
this book. 


Other functions may be already provided as object modules in system libraries that 
the linking phase (Unix) or the linker (OpenVMS) can automatically match by name to 
external calls that we express in an Alpha assembly language program unit. The functions 
that we intend to mention in this chapter for your possible use are those listed in Table 
10.1 for access to stdin or stdout (as illustrated in this section) and to text files (later 
in this same chapter). The calling conventions for the selected functions related to stdin 
and stdout, and for an error-printing function, are set forth in Table 10.2. 


Table 10.2 Calling Conventions for C Functions Related to stdin and stdout 


Function Argument Register(s) Description 
puts first R16 Address of a null-terminated string. 
returned RO Any non-negative value indicates success. 
gets first R16 Address of an adequately large storage area. 
returned RO Passed-in value in R16 if successful, or zero if 
any error occurred. 
printf first R16 Address of the format string. 
other(s) R17orF17 Integer (Rn) or floating-point (Fn) quantities to 
up through be formatted, or the address (Rn) for any string 
R21 or F21 quantities. 
(then stack) 
returned RO Total number of bytes written (or various nega- 
tive codes indicating error conditions). 
scanf first R16 Address of the format string. 
other(s) R17 up Address(es) of integer, floating-point, or string 
through R21 quantities interpreted according to the format 
(then stack) string. 
returned RO Number of input objects processed successfully, 
or EOF if any error occurred. 
perror first R16 Address of an identifying string of your choice. 
returned Prints a system-dependent error message onto 


stderr (meaningful only after an actual error). 
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For this table, we have transformed the interfacing requirements for these func- 
tions from the language of C data types into a register-level perspective that matches the 
low-level calling conventions of Unix (using j Sr instructions) and OpenVMS (using 
the $call macro). When a function returns a useful value (i.e., a value associated with 
the name of the function and its type, usually int) in register RO, we include a brief 
description of that value, but leave detailed error codes to the realm of man reference 
pages (Unix) and vendor-supplied manuals (OpenVMS). Such selectivity is in keeping 
with minimizing in this book the amount of material which lies closer to operating sys- 
tem concepts and details than to architecture concepts. 

Unix-like or C-like I/O can operate in several ways, and each approach has its 
range of applicability, or strengths and weaknesses if you prefer. Working with one byte 
at a time, as we have previously shown, is generally applicable but places every burden 
of interpretation upon the calling program: to test for end of file, to take notice of new- 
line characters, and to convert numeric data between external ASCII and internal binary 
representations. 


Unformatted Line I/O (gets and puts) 


Applications not only involving traditional “dumb terminals” (e.g., a Telnet session) but 
also involving text files are usually line-oriented to a significant degree. That is, the 
conceptual unit of input or output is the line of some n characters, not the single charac- 
ter. The standard C library meets this need with the gets and puts functions, which 
move an entire line of ASCII characters between “the system” or the external environ- 
ment and some particular storage region managed by the calling process. With gets 
the storage must be large enough to accommodate all the characters plus an ASCII null 
character which internally replaces the external newline character With puts a new- 
line character is appended when an ASCII string that is null-terminated in internal stor- 
age is moved to the external environment. 


The advantage of gets over getchar and puts over putchar is the elimina- 
tion of many instructions in an input loop, while retaining in most programs an essential 
processing loop to deal substantively with each line. That is, the program will concen- 
trate more on its own particular work, with less concern for certain I/O details. 


Formatted I/O (printf and scanf) 


Anyone who has written one-off programs appreciates the convenience of the formatted 
I/O methods provided by high-level languages, such as the print statement in BASIC, 
the writeln procedure in Pascal, or the printf function in C. Those provisions for 
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formatted output are especially convenient because they can be used either somewhat 
loosely for rough-and-ready output or rather more carefully when details about appear- 
ance are important. 

On the Alpha, printf expects integer quantities to be passed by value in integer 
registers and floating-point quantities to be passed by value in floating-point registers. 
Sixth or subsequent quantities must be passed by value in the expected region on the 
stack. Output objects which are strings are passed by address. Note that the printf 
function must also be told explicitly when (where) to put out a newline character. On 
the Alpha, you do not generally see any actual output until a newline has been put out. 
One call to printf can print more than one line of output. We refer you to books or 
manuals on ANSI C for information about the format string used by printf. 

The inverse of printf is the scanf function for formatted input. Note that a 
single call to scanf may need to work past newline markers in order to satisfy the total 
number of objects designated by the format string. We refer you to books or manuals on 
ANSI C for information about the format string used by scanf. Note especially the 
ability of scanf to read individual “words” of text (including any immediately adja- 
cent punctuation characters) if you use “Ops” as the format string, because each such 
input field is terminated by a space, tab, or newline. 


SCANTERM: Using C Standard I/O 


We presented in Chapter 6 a simple program called SCANTEXT which would scan 
through a stored string, byte by byte, looking for space characters as separators between 
words. That program, like many of our other rudimentary illustrations, had no provision 
for I/O other than the debugger. 

Now we can rework the SCANTEXT program and add input and output from the 
“terminal” (i.e., the keyboard and display) through the use of external calls to some of 
the standard C library functions whose interfacing requirements have been given in 
Table 10.2. The new SCANTERM program (Figure 10.3) prompts for a line of input 
using puts, obtains that input using gets, prints each word on a line by itself using 
puts, and prints a summary line containing two embedded numeric fields in decimal 
radix using printf. 


SCANTERM: Using C Standard I/O 
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/* SCANTERM Demonstrate I/O for terminal (Unix) */ 


/* This program does lexical analysis by breaking an input 
line apart into separate words. These separate words are 
(Stop it with CTRL/C.) */ 


then written out, 


PRMT: 
TELL: 


main: 


line: 


word: 
char: 


one per line. 


( (STACK*8+8) /16 


# 


~~ 


+ # Oe 


# 
# 
# 
# 


Quadwords needed 
*16 

Input allowance 
Output allowance 
ASCII for space 


Quadword alignment 
Input line 
Quadword alignment 
Output line 


text" 
words and %d characters." 


# 


FEE OE H H e HE H 


# 


STACK = 1 

FRAME = 

IBUFL = 256 

OBUFL = 20 

SPACE = 32 

.data 

-align 3 

. comm IBUF, IBUFL 
.align 3 

. comm OBUF , OBUFL 
-asciiz "Enter a line of 
-ascii "The line had $d 
. byte Oxa,0 

.text 

.align 4 

.set noreorder 

-globl main 

„ent main 

ldgp Sgp,0 ($27) 

lda $sp, -FRAME ($sp) 
stq $26,0($sp) 

.mask 0x04000000, -FRAME# 
.frame S$sp,FRAME,$26,0 # 
-prologue 1 

-globl line 

lda $16, PRMT 

jsr $26,puts 

ldgp Sgp,0 ($26) 

BL $0,stop0 

lda $16, IBUF 

Ter $26,gets 

ldgp Sgp,0 ($26) 

bit $0,stopl 

bis $31,$31,$9 

bis $31, $31,510 

lda $11, IBUF-1 

lda $12,OBUF-1 

lda S1 TESILL) 


$E H OSE OSE OE OE OE OE OE OE E E Ht: 


Newline, null 
Section for program code 
Octaword alignment 
Disallow rearrangements 
These three lines 
mark the mandatory 
'‘main' program entry 
Load the global pointer 
Allocate stack space 
Save our own exit address 
Saved only register R26 
Describe the stack frame 
Say that Sgp is in use 


R16 -> prompting phrase 
Prompt for input 
Restore global pointer 
EOF or other error? 

R16 -> input buffer 

Get some input 

Restore global pointer 
Error? 

R9 counts characters 
R10 counts words 
Pre-index to input area 
Pre-index to output area 
R11 -> input location 
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lda S129, i tSei2) # R12 -> output location 
ldgq_u aL ISEL] # Quadword containing char 
addq i A # Count one charačter 
extbl Bil, Sab yee. # R1 = character code 
cmpeq a1 ,0, 523 # R13 = 1 if end of line 
cmovne $13,SPACE,$1 # Use space at end of line 
subq $1,SPACE, $22 # R22 = 0 if Found a space, 
cmoveq $22,$22,$1 # then convert space to null 
byte: ldg_u O23 4 (S12) # Get quad from byte location 
mskbl $23,812,523 # Make a hole in the quad 
insbl $1,512,821 # Shift the byte 
bis 515323, 51 # Assemble modified quad 
stq_u $1; ($12) # Store the byte 
bne $22, char # Took for more chars? 
addq 610,131. 810 # Count one word 
lda $16,OBUF # R16 -> output buffer 
162 $26,puts # Print the word 
ldgp Sgp,0($26) # Restore global pointer 
blt $0, stop0 # Error? 
beq $13,word # Look for more words? 
subq $9,159 # Correct for end of line 
lda $16, TELL # R16 -> control format 
mov £10,827 # R17 = number of words 
mov $9,$18 # R18 = number of characters 
jsxr $26,printf # Print the summary 
ldgp Sgp,0($26) # Restore global pointer 
Blt $0,stop0 # Error? 
br $31,line # Look for another line? 
stop0: # Output error 
stopl: # BOF or input error 
done: mov 0,$0 # Signal all is normal 
ldq $26, ($sp) # Restore exit address 
lda ċsp, FRAME (Ssp) # Restore stack level 
ret S31, ($26) ,1 # Back to Unix environment 
end main # Mark end of procedure 


Figure 10.3 SCANTERM: Showing calls to C functions for standard I/O 


We encourage you to study SCANTERM carefully, because it employs several 
techniques that are quite generally useful. Most of the program comprises one big loop 
extending from line up to done. Many traditional line-oriented utility programs that 
contain an apparent infinite loop like the big loop in SCANTERM depend on exception 
handling by the operating environment for their means of being stopped. For both Unix 
and OpenVMS, a program will be forced to stop if control-C is entered from the key- 
board when the program needs input, €.g., when gets is called in SCANTERM. 
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In the Unix variant of SCANTERM, the global pointer must be reset immediately 
after return from a routine called using jsr with register R26. Only then should any 
returned values be accessed or be tested for errors, as here with the conditional branches 
to labels stop0 and stop1. In the interest of keeping SCANTERM simple, we have 
not developed any error reporting or recovery coding at those symbolic locations. 
Therefore this program will just exit if any error does occur. Real-world applications 
frequently must contain very extensive error trapping. 


Inside the big loop, SCANTERM has two more nested loops. One loop begins at 
word and is traversed once per word encountered in the input text stream. Another loop 
begins at char and is traversed once per byte encountered in the input text stream. 
Loop control is a design issue, and we have here chosen a method which you may deem 
somewhat tricky, but it typifies what is readily possible when one is programming at the 
assembly language level. 


In our situation, we need to treat the null which indicates end of line in internal 
string storage as a word terminator, i.e., to be equivalent to a space character following a 
word. That is, there are the two cases to be recognized: a word followed by a space, and 
a word occurring at the end of a line. Then these cases need to be logically joined in a 
common treatment for output. What we do, therefore, is convert the null to a space with 
the cmovne instruction, but we also remember that we saw the null by means of the true 
condition set in register R13 by the previous cmpeq instruction. These instructions have 
the effect of making the word at the end of the line subsequently seem just like any previ- 
ous word. The end of line condition is later tested after each word has been displayed, 
when it is time to branch back to word unless this was the last word on a line. 


We also need to convert the now-universal space character at the end of each word 
into the null character required at the end of any string to be displayed using puts. 
This is accomplished with the cmoveq instruction that moves a scratch copy of a char- 
acter whose ASCII code has been diminished by the ASCII code for a space only if that 
result was a null, i.e., the character value in register R1 had in fact been a space. 
Remember that Alpha cmov instructions are very efficient because they simplify cod- 
ing which would otherwise have to include one more branch instruction. 


The control logic of this program is thus concentrated in the four instructions that 
precede the label byte. At byte, register R1 contains either a printing character or a 
null. Register R22 contains the value zero if the end of a word has not yet been seen, 
and looping back to char is appropriate. Register R13 contains the value zero if end of 
line has not yet been seen, and looping back to word is appropriate. Note that this latter 
register must be one of those saved across procedure calls (Table 7.2) because puts is 
called once per word. 
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The byte load and store portions of SCANTERM follow directly from the princi- 
ples and examples given previously in Chapter 6. Again in this program we use pre- 
indexing to make the pointers (registers R11 for input and R12 for output) work 
smoothly the first time and every time through the loops. The counters (registers R9 and 
R10) are incremented at appropriate points when each character or each word has been 
seen. (The number of characters has to be corrected because we are not considering the 
end of line to be an actual character.) 

The calls to the C functions (puts, gets, printf) follow the principles that we 
developed in Chapter 7 for the Unix and OpenVMS calling standards, using the specific 
argument requirements given in Table 10.2. 


Sample input: one two three 
Sample output: one 

two 

three 


The line had 3 words and 13 characters. 


Note how easy it is to use the print f function from assembly language. 


SORTSTR: Sorting Strings from stdin 


A major task to which computers are applied is the sorting of numeric or string data. In 
the study of computer science, one encounters numerous algorithms for sorting because 
each has its strengths and weaknesses. Here we will present an implementation of one 
of the simplest such algorithms, the bubble sort. Our purpose is not to engage in any 
discussion about the relative merits of various approaches to sorting, but rather to show 
how to combine some of the elements of the Alpha instruction set and apply them to 
this very important task. 

The SORTSTR program (Figure 10.4), which sorts strings alphabetically accord- 
ing to the collating sequence of the ASCII character codes, resembles the SCANTERM 
program (Figure 10.3) in its input and output aspects, but it also illustrates some of the 
principles of byte-oriented instructions from Chapter 6. Later in this chapter, we present 
a companion program in which numeric data are similarly sorted. 
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/*  SORTSTR Bubble sort strings (Unix) */ 
/* This program will read 100 or fewer strings (each less than 


80 characters) from the standard input, sort them using the 
bubble sort algorithm, and display the sorted list. */ 


SIZE = 80 # Size of individual string 

SIZEQ = ((SIZE+7)/8)*8 # Round up to quadwords 

CASES = 100 # Array length 

STACK = J # Quadwords needed 

FRAME = ((STACK*8+8) /16) *16 

.data 

-align 3 # Quadword alignment 

.comm ARRAY,CASES*SIZEQ# Room for 100 strings 
PRMT : -asciiz "Enter strings (null line to end)" 

.text # Section for program code 

-align 4 # Octaword alignment 


sh 


Disallow rearrangements 
-globl main # These three lines 
.ent main # mark the mandatory 
main: = 'main' program entry 
= 
z 


.set noreorder 


ldgp Son, 0(S27) Load the global pointer 
lda Ssp, -FRAME (Ssp) Allocate stack space 

stq $26,0(Ssp) # Save our own exit address 
.mask 0x04000000,-FRAME# Saved only register R26 
.frame S$sp,FRAME,$26,0 # Describe the stack frame 


.prologue 1 # Say that S$gp is in use 
-globl start 
Starts lida $16, PRMT R16 -> prompting phrase 
jsr $26,puts Prompt for input 
ldgp Sgp,0($26) Restore global pointer 
DLE S$0,stop0 Error? 
bis $315... 6o R9 counts lines (strings) 
lda $11,ARRAY-SIZEQ Pre-index to storage area 
line: lda S11, Sl emots.t.. Advance for next element 


# 

= 

= 

= 

= 

= 

5 
mov Sli S16 # R16 -> element in ARRAY 
Isr $26,gets # Get some input 
ldgp Sop, 0 (926) # Restore global pointer 
beq S$0,stopl # EOF or other error? 
ldq_u $1,,.($11,) # Quadword with first byte 
extbl Sd yo LL xo # R1 = character code 
beg S1,80rt # Null signals end of input 
addq SS ,1,50 # Count one line 
lda $21,SIZEQ-1($11) # R21 -> last byte 
ldgq_u Si, ($21) # Quadword with last byte 
mskbl 61,821,594 # Force it to be a null 
stq_u SE- (one) # Rewrite the quadword 
cmpule $9,CASES,S$0 # If storage is ok, 
blbs $0,line # then look for more 
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Soret 
Oo. Loop: 


subg 
mov 
lda 
lda 
mov 
mov 
lda 
lda 
ldq_u 
ldq_u 
extbl 
beq 
extbl 
cmpeq 
blbs 
cmpult 
blbs 
subg 
subq 
lda 
lda 
ldg 
ldq 
stq 
stq 
cmpbge 
beq 
cmpbge 
beg 
lda 
lda 
subq 
bgt 
subq 
bgt 
lda 
lda 
mov 
TSE 
ldgp 
blt 
subg 
bgt 
stopo0: 
stopil: 
done: 


i_loop: 


look: 


why: 


swap: 


adjust: 


p_loop: 


mov 
ldq 
lda 
ret 
.end 


Figure 10.4 SORTSTR: Bubble sort for strings 


$9,,1,524 
$24,$23 
$i1,ARRAY-1 


$12,ARRAY+SIZEQ-1 


Sil, $21 
S12, S22 
Tea (ea) 
S22), 1 (S22) 
SLs ($21) 
Sa, (522) 
$1,521,815 
$16,adjust 
G2, See, oad 
S16, 517,30 
$0, look 
“16; 517,60 
$0,adjust 
SEL: Ta Sek 
S12, 7,22 
521:8 (§21) 
$22,8($22) 
oly (52h) 
Say ($22) 
Ş1, ($22) 
S2 a ($21) 
$31,$1,$0 

$0, swap 

S31 32,80 

$0, swap 
$11,SIZE0($11) 
S12, SIZEO(S12) 
S23, 1,525 

623,1 100p 
$24,1,524 
$24,0_loop 
$11,ARRAY-SIZEQ 
$11,SIZEQ($11) 
S11. SiG 
$26,puts 
Sgp,0($26) 
$0,stop0 

Sg 1S9 
$9,p_loop 


0,$0 

$26, ($sp) 

Ssp, FRAME ($sp) 
$31, ($26) «1 
main 


4 tk te 
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R24 counts down outer loop 


R23 counts down inner loop 
R11 -> Ist string 

R12 -> 2nd string 

R21 -> 1st string 

R22 -> 2nd string 

R21 -> byte in 1st string 

R22 -> byte in 2nd string 

RI = quad from 1st string 

R2 = quad from 2nd string 

R16 = byte from 1st string 


Null means 1,2 order is ok 
R17 = byte from 2nd string 
If equal at this position, 
go look at next bytes 

I£ 1,2 order is ök, 

go look at next strings 
Pre-index the addresses 

for quadword access now 
Advance by one quadword 

in both storage areas 
Get a quadword 

from each string 
Swap the storage locations 

of these quadwords 
If first has no null, 

then swap is not done 
If second has no null, 

then swap is not done 
Advance to next as 1st 
Advance to next as 2nd 
Count down... 

se. tOf inner loop 
Count down... 

...for outer loop 
Pre-index to storage area 
Advance for next element 
R16 -> element in ARRAY 
Print one line 
Restore global pointer 
Output error? 

Count one line done 

and repeat if more 
Output error 
EOF or input error 
Signal all is normal 
Restore exit address 
Restore stack level 
Back to Unix environment 
Mark end of procedure 
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Each string is allotted SIZE bytes (including the null appended by gets. Cau- 
tion: with gets the system software offers no protection against data overrun. 
Although we do not test for overrun, the input loop Starting at line does ensure that 
each string has a null at the end, even if that might truncate the string. The program does 
not warn when ARRAY is filled, but just proceeds to the sorting phase with the maximal 
number of strings. 

The pattern of register use in SORTSTR is somewhat deliberate. Register R11 is 
used as a semi-permanent pointer to one string element in ARRAY, although different 
amounts of pre-indexing are used at various times. Register R21 is used as a less-per- 
manent pointer initialized relative to register R11. Byte or quadword data loaded and 
stored using register R11 or R21 are held in register R1. When two strings are being 
considered at the same time, registers R2, R12, and R22 are used analogously for the 
second string. Also note that a program has free use of registers R16-R21 (and others) 
in a section where it calls no other procedures; here registers R16 and R17 are used in a 
compute-bound section of the program where no I/O functions are called. 

The sorting section starting at sort consists of outer and inner loops required for 
any implementation of the bubble sort algorithm. The outer loop starts at o_loop and 
is controlled by register R24. The inner loop starts at i loop and is controlled by reg- 
ister R23. These loops envelop two other non-overlapping loops, one to make compari- 
sons and the other to perform interchanges of strings. 

The loop starting at look inspects two adjacently stored strings, byte by byte, to 
determine whether they are already in proper order or must be interchanged. 

The loop starting at swap interchanges two strings, eight bytes at a time. This 
technique, which requires quadword alignment and a storage allotment in quadword 
multiples, is faster than moving data by the byte, i.e., requires fewer instruction cycles. 

In classic textbook explanations of the bubble sort algorithm, a temporary storage 
element equivalent in capacity to the entities being sorted is shown in operation: 


TEMP :=A 
A := B 
B := TEMP 


Here it would be a waste of effort to move entire strings twice. Because we have point- 
ers to them (registers R21, R22) and quadword-length extracts from them (registers R1, 
R2), simply storing each extract using the other pointer accomplishes the swapping. 
Another interpretation is that two temporary storage elements are being used (registers 
R1, R2). 

Control of the various loop structures in this program has been designed with few- 
est branch instructions and the smoothest logical flow. At adjust the pointers into 
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ARRAY are advanced by the quadword-rounded size (SIZEQ) allotted to each string. 
Note the use of cmpbge instructions to see whether either string contains any null char- 
acters. Remember also that in a bubble sort the inner loop is initialized for one fewer 
iteration than the remaining scope for the outer loop. 

The strings are always manipulated in ways which preserve the null at the end, 
and they begin at regularly spaced addresses. Thus it is very easy to print them out, after 
the sorting phase, using the puts function in the loop starting at p_loop. 


Sample input: two 
Two 
TWO 
three 
ten 
thirteen 
twenty 
one hundred 


Sample output: TWO 
Two 
one hundred 
ten 
thirteen 
three 
twenty 
two 


Note that the ASCII collating code does not result in the type of alphabetizing which 
merges upper-case and lower-case entries (as is done for the index of this book). 


Text File I/O 


A system-supplied standard C library provides functions for performing text-oriented 
I/O not only from the standard input and output, but also from and to text files stored 
on disk. (We will not explicitly treat I/O from or to other storage media, such as mag- 
netic tape.) 

Most of the file-related functions are provided as object modules in system librar- 
ies which the linking phase (Unix) or the linker (OpenVMS) can automatically match 
by name to external calls that we express in an Alpha assembly language program unit. 
The calling conventions for the selected functions related to text files are set forth in 
Table 10.3. 
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Table 10.3 Calling Conventions for C Functions Related to Text Files 


Function Argument Register(s) Description 
fopen first R16 Address of the file specification (character string). 
second R17 Address of the mode (character string). 
returned RO File pointer (address of block of information 
about the file) or zero if any error occurred. 
fclose first R16 File pointer (address of block of information 
about the file). 
returned RO Zero indicating success or EOF if any error 
occurred. 
fputs first R16 Address of a null-terminated string. 
second R17 File pointer (address of block of information 
about the file). 
returned RO Zero indicating success or EOF if any error 
occurred. 
fgets first R16 Address of an adequately large storage area. 
second R17 Size of the storage area being provided. 
third R18 File pointer (address of block of information 
about the file). 
returned RO Passed-in value in R16 if successful, or zero if 
any error occurred or end of file was detected. 
LOLINtL first R16 File pointer (address of block of information 
about the file). 
second R17 Address of the format string. 


other(s) R18 or F18 up Integer (Rn) or floating-point (Fn) quantities to 
through R21 or be formatted, or the address (Rn) for any string 
F21 (then stack) quantities. 


returned RO Total number of bytes written (or various nega- 
tive codes indicating error conditions). 
Escanf first R16 File pointer (address of block of information 
about the file). 
second R17 Address of the format string. 


other(s) R18upthrough Address(es) of integer, floating-point, or string 
R21 (then stack) quantities interpreted according to the format string. 


returned RO Number of input objects processed successfully, 


or EOF if any error occurred. 
cael cep gu cs etc Tas insect 





306 Chapter 10 « Input and Output of Text 


For this table, we have transformed the interfacing requirements for these func- 
tions from the language of C data types into a register-level perspective that matches the 
low-level calling conventions of Unix (using jsr instructions) and OpenVMS (using 
the $call1 macro). When a function returns a useful value (i.e., a value associated with 
the name of the function and its type, usually int) in register RO, we include a brief 
description of that value, but leave detailed error codes to the realm of man reference 
pages (Unix) and vendor-supplied manuals (OpenVMS). Such selectivity is in keeping 
with minimizing in this book the amount of material which lies closer to operating sys- 
tem concepts and details than to architecture concepts. 


Directory-Level Access (fopen and fclose) 


With standard input and output, no special preparations are necessary because the 
environment in which a process runs already has stdin and stdout predefined and 
ready to accept any I/O by a program. With file I/O, an obvious preliminary require- 
ment involves specifying which particular file, out of the great many potentially acces- 
sible, is to be designated for I/O activity. In many programming languages, this 
designation is called opening a file, just as it is with a word processor application on a 
personal computer. Since resources are consumed both for the process and within the 
operating system to establish and maintain certain data structures related to file access, 
there is also the concept of closing a file when a program or process has no further I/O 
activity for a particular file, again fully analogous to finishing up with a word processor 
application. 


All of the file-related I/O functions which we have elected to include in Table 
10.3 involve a concept called a file pointer, which is the address of a data structure 
provided by the operating environment for keeping track of certain information about 
a file while it is open for I/O by a process. None of our sample programs will follow 
that read-only pointer to probe the data structure to which it points. The internal com- 
position of that data structure may be implementation-dependent, 1.e., different for 
each operating system. 


The fopen function needs two pieces of information as its input arguments, both 
of which are supplied in the form of address pointers leading to null-terminated charac- 
ter strings. The first of those arguments, the name of the file, must be specified accord- 
ing to the conventions of the particular operating environment. If the system-supplied 
defaults are not suitable, the file name may need to include device and/or directory 
specificity. The permitted ASCII characters and overall length for a file name are also 
system-dependent. 


a a ts ca a it gg 
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The second argument for fopen indicates how the file is to be accessed. This 
access mode specification has (at least) three values for any implementation of the stan- 
dard C library: 


r Access is established to read from the file. 
w Access is established to write into the file. 


a Access is established to append onto the previous end of the file. 


For the r mode, the file must already exist. For the w and a modes, a new file may be 
created. For the w mode, a Unix system will discard any older file of the same name, 
while an OpenVMS system will create the new file with a higher version number. 

The file pointer address returned in register RO should be retained in a local vari- 
able on the stack or in some other general register whose contents are preserved across 
procedure calls (Table 7.2). All of the other file-access functions (Table 10.3) require 
this file pointer as an input argument to convey which one, among perhaps several open 
files, is to be accessed by calling the I/O function. In general, the file pointer will be an 
address value within a region of address space mapped by the memory access and pro- 
tection subsystem to permit shared access by the process and also by the operating sys- 
tem modules where I/O is physically handled. 

The fclose function requires only the file pointer as an input argument. When 
the system closes a text file that had been opended in w or a mode, any intermediary 
hidden buffers are flushed and all pending physical write operations are permitted to 
complete before the close operation is considered to have been accomplished. 

A program may simultaneously access numerous open files. The assembly lan- 
guage programmer or the compiler must manage all of the respective file pointers. If 
there are more than a few, a logical array might be appropriate to hold those pointers. 


Unformatted Line I/O (fgets and fputs) 


Text files are usually line-oriented to a significant degree. That is, the conceptual unit of 
input or output is the line of some n characters, not the single character. The standard C 
library meets this need with the fgets and fputs functions, which move an entire 
line of ASCII characters between a file in the external environment and some particular 
storage region managed by the calling process. 

With the fgets function, the second argument conveys the size of the storage 
area provided by the calling program. Input will terminate at the first newline encoun- 
tered, or will be stopped short to permit a null to be put in the last byte location. Unlike 
the gets function, fgets brings the newline into the internal storage. 
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We have here used the Unix nomenclature, that is, newline character (in the singu- 
lar). Various computer operating systems actually use their own special conventions for 
representing newline in files (Table 10.4). When the C-like fgets and fputs func- 
tions are implemented on any system, those functions always use a null terminator at 
the end of strings in internal storage (i.e., in memory). 


Table 10.4 Line Terminators in Text Files 


Operating ASCII Hexadecimal 
System Symbol(s)* Codes Remarks 

DOS; Windows NT <cr><lf> Od 0a 

Mac OS <or> od Macintosh files also have a 
“resource fork” containing things 
like icons. 

OpenVMS None None RMS sequential files have a 2- 
byte length count prefixed to each 
line. 

OpenVMS <cr><lf> Od Oa Alternative “stream format” files. 

Unix <1t> Oa 


ĖS 


* <cr> means the carriage return character; <1 f> means the line feed character. 


Unlike the puts function, fputs does not automatically append any newline 
character to its output. The correlated features of fgets and fputs have the result 
that a program loop repetitively using fgets to obtain data from one file and using 
fputs to move the same data into a new file would produce an exact copy, regardless 
of line lengths in the data being copied. 


Formatted I/O (fprintf and fscanf) 


The fprintf and fscanf functions (Tables 10.1 and 10.3) work with format control 
strings to interpret information which they move between the external environment in 
text files and internal storage (memory). Their operation is quite analogous to that 
already described for the printf and scanf functions. The first argument (in register 
R16) for a call to fprintf or fscanf is the file pointer corresponding to the previ- 
ously opened text file into or from which I/O is to occur. 


On the Alpha, fprintf expects integer quantities to be passed by value in inte- 
ger registers and floating-point quantities to be passed by value in floating-point regis- 
ters. Fifth or subsequent quantities must be passed by value in the expected region on 
the stack. Output objects which are strings are passed by address. Note that the 
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fprintf function must also be told explicitly when (where) to put out a newline char- 
acter. One call to fprintf can print more than one line of output. We refer you to 
books or manuals on ANSI C for information about the format string used by 
EPFinct. 

The inverse of fprintf is the £scanf function for formatted input. Note that a 
single call to fscanf may need to work past newline markers in order to satisfy the 
total number of objects designated by the format string. We refer you to books or manu- 
als on ANSI C for information about the format string used by fscanf. Note espe- 
cially the ability of fscanf to read individual “words” of text (including any 
immediately adjacent punctuation characters) if you use “%s” as the format string, 
because each such input field is terminated by a space, tab, or newline. 

Most programs using fscanf will perform formatted I/O on a whole file in a 
loop. Termination of that loop is eventually signaled by an end of file (EOF) condition 
returned in register RO. Each system defines some suitable negative value to represent 
the EOF condition. On Alpha systems, minus one is used. 


SCANFILE: Input and Output with Files 


As a first example of input and output with files, we continue the progression from 
SCANTEXT to SCANTERM to SCANFILE (Figure 10.5). For this (last) adaptation of 
the same general concept, we have designed a program which simply goes through an 
input text file and puts out the isolated words (with any immediately adjacent punctua- 
tion), one per line, into an output text file. The SCANFILE program also counts the 
number of words (or lines in the output file) and displays that value on the screen (or 
terminal). 


/*  SCANFILE Demonstrate I/O for files (Unix) */ 
/* This program does lexical analysis by reading "words" 


from a text file using fscanf. These separate words are 
then written out, one per line, using fprintf. */ 


STACK = 8 # Quadwords needed 
FRAME = ((STACK*8+8) /16) *16 

BUFL = 80 # Allowance for a "word" 
FLEN = 40 # File name allowance 
.data 

-align 3 Avoid load exceptions 


. comm BUF, BUFL 
.comm IPTR, 8 
.comm OPTR,8 


Input/output buffer 
Input file pointer 
Output file pointer 


+ FF OH OE 
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IPRMT: 
OPRMT : 
IMODE: 
OMODE : 
TELL : 

PFORM: 
IFORM: 
OFORM: 


main: 


start: 


# 
# 


# 
# 
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Input file name 
Output file name 


Read from input file 
Create output file 


processed %d words.\n" 


# 
# 
# 


+ HOHE OH OH OH H OH 


# 


.comm IFILE, FLEN 

. comm OFILE, FLEN 
ASC? "Input from? * 
sasciiz “Output tor " 
aselag ye" 

.asciiz "w" 

.asciiz "The program has 
@aCLlig. "$s" 

ascii "Sgin" 

. text 

.align 4 

.set noreorder 

zglobi main 

.ent main 

ldgp $gp, 0 ($27) 

lda $sp,-FRAME ($sp) 
stq $26,0($sp) 
.mask 0x04000000, -FRAME# 
.frame S$sp,FRAME,$26,0 # 
.prologue 1 

.globl start 

lda $16, PFORM 

lda $17, IPRMT 

IBY S26, printf 
ldgp Sgp, 0 ($26) 

blt S0,stop0 

lda $16, IFILE 

ates $26,gets 

ldgp Sgp,0 ($26) 

beq $0; stopi 

lda $16, IFILE 

lda $17, IMODE 

ST $26, fopen 

ldgp $gp, 0 ($26) 

beq S0 stODp2 

stq $0,IPTR 

lda $16, PFORM 

lda $17,OPRMT 

jsxr $26,printft 
ldgp $gp,0($26) 

bit $0,stop0 

lda $16,OFILE 

J SY $26,gets 

ldgp $gp,0 ($26) 

beq S$0,stopl 

lda $16,OFILE 

lda $17, OMODE 


SE OE OE OE ORO HEHEHE HEHEHE OH OH OH OH OH OH OH HH HK HE HH OH OH H+ 


Prompts are strings 
Scan for a "word" 
Print "word" then newline 


Section for program code 
Octaword alignment 
Disallow rearrangements 
These three lines 
mark the mandatory 
'main' program entry 
Load the global pointer 
Allocate stack space 
Save our own exit address 
Saved only register R26 
Describe the stack frame 
Say that $gp is in use 


R16 -> format string 

R17 -> prompting phrase 
Prompt for input 

Restore global pointer 
Error? 

R16 -> file name storage 
Get a name 

Restore global pointer 
EOF or other error? 

R16 -> file name storage 
R17 -> input mode 

Get a name 

Restore global pointer 
Error? 

Save the file pointer 
R16 -> format string 

R17 -> prompting phrase 
Prompt for output 
Restore global pointer 
Error? 

R16 -> file name storage 
Get a name 

Restore global pointer 
EOF or other error? 

R16 -> file name storage 
R17 -> output mode 
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Ter $26, fopen 
ldgp Sgp,0 ($26) 
beq $0,stop4 
stq $0,OPTR 
bis Sal: $31,610 

loop: lağ $16,IPTR 
lda $17, IFORM 
lda $18, BUF 
Isr $26, fscanf 
ldgp Sgp,0 ($26) 
cmpeq $0,1,$1 
blbc $1,eof 
ldq $16, OPTR 
lda $17, OFORM 
lda $18, BUF 
jsr $26, tprintt 
ldgp Sgp,0 ($26) 
bit $0,stop7 
addq $10,1,$10 
br $31,loop 

eof: bge S0,stop6 
ldq $16, IPTR 
JSF $26, fclose 
ldgp Ş$gp, 0 ($26) 
BLE S$0,stop3 
ldq $16,OPTR 
JBE $26, fclose 
ldgp Sgp,0 ($26) 
blt 50, stop5 
lda $16, TELL 
mov $10,517 
Sr <26, printf 
ldgp Sgp,0($26) 
bit S0,stop0 
br $31,done 

stopo: 

Stople 

stop2: 

stop3: 

stop4: 

stop5: 

stopé: 

stop7: 

done: mov 0, $0 
ldq $26, ($sp) 
lda Ssp, FRAME ($sp) 
ret Soke (S26) od 
end main 


# 
+ 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 


# 


Figure 10.5 SCANFILE: Using C-like input and 
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Get a name 

Restore global pointer 
Error? 

Save the file pointer 
R10 counts words 

R16 = input file pointer 
R17 -> format string 

R18 -> I/O buffer 

Read one "word" 

Restore global pointer 
Expect one %s item 

No - check for EOF next 
R16 = output file pointer 
R17 -> format string 

R18 -> I/O buffer 

Write one "word" 

Restore global pointer 
Error? 

Count one word 

Continue until EOF 

Other than end of file 
R16 = input file pointer 
Close input file 

Restore global pointer 
Error? 

R16 = output file pointer 
Close output file 
Restore global pointer 
Error? 

R16 -> format string 

R17 = word count 

Print summary message 
Restore global pointer 
Error? 

All finished 

Terminal output error 
Terminal input error 
Problem opening IFILE 
Problem closing IFILE 
Problem opening OFILE 
Problem closing OFILE 
Problem getting input 
Problem doing output 
Signal all is normal 
Restore exit address 
Restore stack level 

Back to Unix environment 
Mark end of procedure 


output with files 
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The top portion of SCANFILE begins with user prompting, similar to that in 
SCANTERM, in order to obtain file names. Those file names are tried with the fopen 
function in order to establish access to the input file and create the output file. For this 
program, as for SCANTERM, we have inserted appropriate checkpointing branch 
instructions which ensure that the program proceeds only if all necessary prior condi- 
tions are met. Those branch instructions just lead to various stop labels without any 
corresponding routines for displaying diagnostic messages or taking corrective action. 
We do not wish to bog down in system-specific details about error codes, as we have 
stated previously. 


The functional core of SCANFILE consists of a concise loop starting at the label 
loop. The fscanf function with the control format "%s" reads one “word” from the 
input file, i.e., a string of characters up to a terminator (space, tab, or newline) that pos- 
sibly includes some immediately adjacent punctuation characters. The terminating 
character is not stored as part of the string, which is instead terminated with a null in 
memory. Each such string is then sent into the output file by the fprintf function 
using the control format "s\n" (in C style). The OpenVMS assembler requires us to 
construct this using the ASCII code for a linefeed (Table 2.3). 


Exit from this program loop occurs when a last attempted call to £scanf returns 
the EOF condition (negative value) in register RO. When the branch to eof is taken, the 
program closes both files and then uses printf to display a summary message with 
the number of “words” processed. 


Sample dialog: Input from? lincoln.txt 
Output to? abc.txt 
The program has processed 274 words. 
Clearly the “words” could be scrutinized more carefully by suitable elaboration using 
more instructions within the loop. (The file Lincoln. txt contains a typescript ver- 
sion of the Gettysburg address.) 


We would also have our readers appreciate that the brevity of the loop in 
SCANFILE as compared to that in SCANTERM does not mean SCANFILE is 
faster than SCANTERM. The internal coding in the system-supplied fscanf 
function has to perform character-by-character testing analogous to what we used 
in SCANTERM. Essential work cannot be avoided, but it can be encapsulated and 
made callable. That way, a routine is probably more useful but possibly less effi- 
cient than costly hand craftsmanship. 
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SORTINT: Sorting Integers from a File 


As a second example of working with files, we present a companion to the SORTSTR 
program, which performed a bubble sort on strings entered interactively. Here we dis- 
cuss the SORTINT program (Figure 10.6), which brings integer quantities from an 
ASCII text file into internal binary storage as quadwords, sorts them using the bubble 
sort algorithm, and writes the ordered list of signed integers into a new ASCII text file. 





/* SORTINT Bubble sort integers (Unix) */ 


/* This program will read 100 or fewer integers from a 
text file, sort them using the bubble sort algorithm, 
and write the sorted list into a text file. */ 


STACK = T # Quadwords needed 
FRAME = ((STACK*8+8) /16) *16 

FLEN = 40 # File name allowance 
CASES = 100 # Array length 

NUML = § # Integer size 

.data 

aliga 3 Avoid load exceptions 


= 
FULL: . quad -1 # EOF on this system 

. comm IPTR,8 # Input file pointer 

.comm OPTR,8 # Output file pointer 

. Comm IFILE, FLEN # Input file name 

. Comm OFILE, FLEN # Output file name 
.comm ARRAY,CASES*NUML # Room for 100 integers 


LERMI: ,zasciiz “Input from? * 
OPRMT: .asciiz “Output to? * 
IMODE: .asciiz "r" # Read from input file 
OMODE: .asciiz "w" # Create output file 
TELL: .asciiz "The program has processed %d numbers. \n" 
PFORM: .asciiz "%s" # Prompts are strings 
IFORM: .asciiz "%Ld" # Scan for 64-bit number 
OFORM: .asciiz "%18Ld\n"# Print number then newline 
ERROR: .asciiz "Error" 
. text Section for program code 
-align 4 Octaword alignment 


Disallow rearrangements 

These three lines 

mark the mandatory 
'main' program entry 

Load the global pointer 

Allocate stack space 

Save our own exit address 


.set noreorder 
.globl main 
.ent main 
main: 
ldgp sgp,0($27) 
lda Ssp, -FRAME(Ssp) 
stq $26,0($sp) 


$e Sk SE FE OSE FE OE OE H 
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Start: 


num: 


.mask 

.frame Ssp,FRAME,$26,0 
-prologue 1 

“Globl. start 

lda $16,PFORM 
lda $17, IPRMT 
IST S26, peincet 
ldgp Sgp,0($26) 
bit S$0,stop0 
lda $16, 1 FILE 
jsxr $26,gets 
ldgp Sgp,0 ($26) 
beq 50, Step 
lda $16, IFILE 
lda $17, IMODE 
JSE $26, fopen 
ldgp Sgp,0 ($26) 
beq S0,stop2 
stq $0, IPTR 
lda $16, PFORM 
lda $17,OPRMT 
jsxr 526, prance 
ldgp Sgp,0 ($26) 
blt S$0,stop0 
lda $16,OFILE 
jsr $26,gets 
ldgp Sgp,0 ($26) 
beq 50,stopl1 
lda $16,OFILE 
lda $17, OMODE 
jsr $26, fopen 
ldgp $gp,0 ($26) 
beq $0,stop4 
stq $0,OPTR 
bis SEAE E y SO 
lda $11,ARRAY-NUML 
lda $11,NUML ($11) 
ldq $16,IPTR 
lda $17, 1IFORM 
mov Say pale 
jsxr $26, fscanf 
ldgp $gp,0 ($26) 
cmpeq SO, EASL 
blbc $1,eof 
addq 59. 1;39 
cmpule $9,CASES, $0 
blbs $0,num 

ldg $0, FULL 
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0x04000000,-FRAME # Saved only register R26 


# 
# 


+t oe Sk Sk H d d t d SE SE t d d OSE OSE OSE OSE OSE t OSE OE OE OE OE OE d H HEHEHE t H HEHEHE HEHEHE H H H H H 


Describe the stack frame 
Say that Sgp is in use 


R16 -> format string 
R17 -> prompting phrase 
Prompt for input 
Restore global pointer 
Error? 
R16 -> file name storage 
Get a name 
Restore global pointer 
EOF or other error? 
R16 -> file name storage 
R17 -> input mode 
Get a name 
Restore global pointer 
Error? 
Save the file pointer 
R16 -> format string 
R17 -> prompting phrase 
Prompt for output 
Restore global pointer 
Error? 
R16 -> file name storage 
Get a name 
Restore global pointer 
EOF or other error? 
R16 -> file name storage 
R17 -> output mode 
Get a name 
Restore global pointer 
Error? 
Save the file pointer 
R9 counts numbers 
R11 -> storage (pre-index) 
Advance for next element 
R16 = input file pointer 
R17 -> format string 
R18 -> proper slot in ARRAY 
Read one "word" 
Restore global pointer 
Expect one %d item 
No - check for EOF next 
Count one number 

If storage is ok, 

then look for more 
Fake an EOF 
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eof: bge $0,stop6 
ldq $16,IPTR 
jsr $26,fclose 
ldgp Sgp,0($26) 
Blt $0,stop3 
mov $9. S10 
sorti subq S9,1,$24 
o_loop: mov $24,523 
lda $11, ARRAY 
lda $12, ARRAY+NUML 
1_loop: ldgq SL» (S11) 
ldq S2 ($12) 
cmple 5$1,$2,$0 
blbs S0, adjust 
swap: stq Sy S12) 
stq $2, ($11) 
adjust: lda $11,NUML ($11) 
lda $12,NUML ($12) 
subq $23, 1,823 
bgt $23,1_.L06p 
subq S24,1; 524 
bgt $24,0_loop 
lda $11,ARRAY-NUML 
p_loop: lda $11,NUML ($11) 
ldq $16,OPTR 
lda $17,OFORM 
ldq $18, ($11) 
ist S26, frorintt 
ldgp Sgp,0($26) 
blt $0,stop7 
subq iS TA a 
bgt $9,p_loop 
ldg $16,OPTR 
yer $26, fclose 
ldgp Sgp,0($26) 
blt $0,stop5 
lda $16, TELL 
mov S10: S17 
Jer $26,printeé 
ldgp Sgp,0 ($26) 
blt $0,stop0 
br $31,done 
stop0: 
stopi: 
stop2: 
stop3: 
stop4: 


stop5: 


$e H OSE OSE OSE OSE H H HE E HEHEHE E E E HEHEHE HEHEHE HEC HEHEHE HEHEHE HEHEHE HEHE HEHEHE d d d d H H H HEHE Hts 
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Other than end of file 

R16 = input file pointer 
Close input file 

Restore global pointer 
Error? 

R10 = number of elements 
R24 counts down outer loop 
R23 counts down inner loop 
R11 -> ist number 

R12 -> 2nd number 

R1 = 1st number 

R2 = 2nd number 

LE 1.2 order is ok, 

go look at next numbers 
Swap the storage locations 

of these quadwords 
Advance to next as 1st 
Advance to next as 2nd 
Count down... 

...f0r inner loop 
Count down... 

...for outer loop 
Pre-index to storage area 
Advance for next element 
R16 = output file pointer 
R17 -> format string 
R18 = element from ARRAY 
Print one element 
Restore global pointer 
Output error? 

Count one line done 

and repeat if more 
R16 = output file pointer 
Close output file 
Restore global pointer 
Error? 

R16 -> format. string 
R17 = number count 
Print summary message 
Restore global pointer 
Error? 

All finished 

Terminal output error 
Terminal input error 
Problem opening IFILE 
Problem closing IFILE 
Problem opening OFILE 
Problem closing OFILE 
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stopé: # Problem getting input 
stop7: # Problem doing output 

lda $16,ERROR # R16 -> dummy indicator 

jsxr $26,perror # Print system message 

ldgp Sgp,0($26) # Restore global pointer 
done: mov 0,$0 # Signal all is normal 

ida 526, ($8p) # Restore exit address 

lda Ssp, FRAME (S$sp) # Restore stack level 

ret Sal; ($26) ;1. # Back to Unix environment 


.end main # Mark end of procedure 


Figure 10.6 SORTINT: Bubble sort for integer quantities 


This program combines features from the SCANFILE and SORTSTR programs. 
The pattern of calls to the I/O functions of C is very similar to the structure of the 
SCANFILE program, while the bubble sort portion is very similar to the corresponding 
portion of the SORTSTR program. 

The fscanf function (using $d in the control string) in the data input loop start- 
ing at num expects to be given an address of the location where the interpreted binary 
value is to be stored. In contrast, the fprintf function in the data output loop starting 
at p_ loop expects to be given the actual value as an argument in a register or in a spe- 
cial area on the stack. Refer back to Table 10.3 or to books on the C language for further 
details about format strings and argument passing for these functions. 


Sample dialog: > sortint {run the program } 
Input from? sortint.txt 
Output to? sorted.txt 
The program has processed 5 numbers. 
> cat. sortint. txt {show the input file} 
2 
200 
-20 
123 
24 
> cat sorted.txt {show the output file} 
-20 
2 
24 
123 
200 


On the Alpha, where native integers are 64 bits in width, it is necessary to use L 
with numeric indicators (like $d) in format strings for I/O routines like fscanf and 
fprintf in the C support library when the quantities are anticipated to be negative 
(-20 here) or positive but spilling beyond 32 bits in width. 
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Again with this program, we have suggested where some of the conditional tests 
for errors should be positioned. Just above done we have also shown how to use the 
perror function to display the error text, which may vary with the programming envi- 
ronment or natural language on a particular system. 

We have not written anything specific for further analysis, reporting, or recovery. 
Anomalous situations may arise both from a user’s actions and from the nature of the data 
encountered. From our common experiences with commercial software, we may imagine 
not only how difficult it is to write truly “bulletproof” applications, but also how very 
desirable it is for any error handling to be fully integrated in “look and feel” with the 
whole operating system. Therefore, we leave such concerns to other books that deal more 
directly with operating systems, user interface design, and software engineering. 


Binary Files 


Binary files are even more system-dependent than text files. In general, native I/O sys- 
tem calls for a given operating system must be used. Typically, a program reads or 
writes data in multiples of the physical block size on a peripheral device, working with 
buffer regions which are multiples of that same size set aside in memory by the pro- 
grammer. 

The data in a binary file may be intrinsically unstructured (1.e., just a byte stream) 
or may have some internal structure. Examples of structured data include ISAM files 
(e.g., RMS indexed files on an OpenVMS system) or “direct-access” files containing 
arrays of floating-point data ordered according to the conventions of a specific high- 
level language (e.g., FORTRAN). 

Conventions are arbitrary choices. Not all programming languages, operating sys- 
tems, and processor architectures store binary data the same way. Are the information 
units for the constituent elements of an array stored row by row, or column by column? 
Does the system use big-endian or little-endian representations? 


Summary 


In this chapter, we showed schematically the multiple software layers involved in input 
and output for typical operating system environments. Programmers at the assembly 
language level, and sometimes at higher levels of programming languages, can choose 
one or more of the layers of software routines for I/O implemented by a given program- 
ming environment and operating system. The higher layers may offer abstraction and 
some device-independence, but may not afford access to all of the actual capabilities 
provided by the lower layers of software and, ultimately, by the underlying physical lay- 
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ers. (The writers of fast-paced PC games know this, and thus do exploit drilling down to 
the lower levels of I/O implementation.) 

We elected to discuss some of the I/O functions of the C language because those 
can be called in analogous ways from an Alpha assembly language program in either 
Unix or OpenVMS programming environments. We have shown the details of argument 
passing between such functions and four sample calling programs. These programs 
have illustrated, in various combinations, I/O from and to the keyboard and display, I/O 
from and to ASCII text files, a bubble sort for string data, and a bubble sort for integer 
quantities. We omitted system-dependent aspects of error detection and recovery as 
being too far afield from our central focus on Alpha architecture. 
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EXERCISES 


10.1 Use any available manuals and/or explore actual system behavior to develop appro- 
priate descriptive information like that in Table 10.2 for the sscanf function, 
which reads input from a string already in memory and interprets that string accord- 
ing to a format specification. 


10.2 Explain what sorts of modifications would be required if SCANTERM were to use 
Unix system calls instead of C support functions for its I/O. Search in the reference 
materials accessible to you for relevant technical information. 


Summary 


10.3 


10.4 


10.5 


10.6 


10.7 


10.8 


10.9 


319 


How does SCANTERM behave if it encounters more than one space between 
words? Propose an appropriate revision to the program. 


Would SCANTERM be appreciably simpler if it performed its input using a control 
string of “%s” with the scanf function for its input method? Consider revising the 
program accordingly. 


What would be wrong with putting a test for whether register R17 holds a null at 
why in SORTSTR, and branching to adjust if so? 


The loop beginning at look in SORTSTR uses 9 instruction cycles per byte posi- 
tion at which two strings are compared, or 72 instruction cycles for every 8 bytes 
(equivalent to an aligned quadword). Since the elements in ARRAY are in fact quad- 
word-aligned, consider replacing the four instructions at look with four other 
instructions which would step along the strings by aligned quadwords rather than by 
bytes. Those would require only 4 instruction cycles for every 8 bytes processed. 
Then rewrite the rest of the instruction sequence in the look loop using an inner 
byte loop with a control counter that runs up from 0 to 7 and is thus directly appro- 
priate for the middle argument of the extbl instruction. Explain whether your 
rewritten segment, admittedly involving more instructions on the page and in mem- 
ory, would nevertheless result in an overall reduction in instruction cycles for com- 
paring each 8-byte span of the two strings. How many additional registers are 
needed? 


Assume that two particular strings being interchanged by the loop beginning at 
swap in SORTSTR each have 79 printing characters, followed by a null. How many 
instruction cycles does the loop beginning with swap as shown in Figure 10.4 use? 
About how many instruction cycles would a loop that moved bytes instead of quad- 
words use? 


Explain in general terms how SORTSTR would have to be modified to bring two, 
Two, and TWO into adjacency in the sorted output. 


What elements in the typescript file Lincoln.txt does the SCANFILE program 
inaccurately consider to be words? Discuss how a program could better conform to 
the standard views of what is, or is not, a word. (This may be harder than it first 
seems to be.) 


10.10 Sometimes the last few passes in a bubble sort are futile because all the data by then 


are already in the desired order. A flag variable can be used to determine whether the 
traversals of the inner loop resulted in any exchanges at all (see Knuth). If no 
exchanges occurred, the outer loop can safely terminate too. Make this improvement 
in either SORTSTR or SORTINT. You can use printf to print the value of the 
control variable for the outer loop if you wish to demonstrate that this improvement 
is working. 


10.11 Find out what actually happens when SORTINT encounters a decimal point in one 


of the numbers in its input file (such as 2.71). 
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10.12 Explain what sorts of changes would be required in order to convert SORTINT into 
a program which could handle real numbers (having decimal points and exponential 
notation) instead of integers. If your instructor so requires, proceed to implement 
and test those changes. 


10.13 Redesign the routine for computing the function N 3 4 N based on the finite differ- 
ences algorithm, following these revised specifications: 


a.Input N, the number of values to be computed, as a string of decimal digits entered 
by the user. 


b.Print each value of the function as soon as it is generated, formatting it as a string 
of decimal digits on a line by itself. 


c.Print a blank line before going back to input N again. (Quit the program with con- 
trol-C.) 


Test the program using first N=8 and then N=6 (in a single run). 


10.14 Redesign and extend the routine that tests for palindromes so that it will deal with 
palindromic sentences. The input is to come from the user in mixed upper/lower 
case, with appropriate punctuation for reading in the forward direction. If the sen- 
tence is a palindrome (ignoring spaces, punctuation, and case), then print “Yes” else 
print “No”. Reprompt for another line of input. (Quit the program with control-C.) 


10.15 Write an Alpha assembly language program that will merge two input text files, 
each containing already-sorted lists of (a) signed integer quantities or (b) signed real 
numbers, into an output text file. Be sure that the program correctly handles the end 
of file cases that arise when the second file is shorter than, the same length as, or 
longer than, the first file. 


10.16 Write a program that prompts for the name of a text file and reads it sequentially. 
This program is to calculate the frequency distribution of different word lengths 
(discarding all punctuation characters). Assume that the word lengths will range 
from one letter to twenty letters. Print out into a new text file the name of the input 
file, a distribution table of word lengths, the average word length (use integer divi- 
sion if you omitted Chapter 8), and the most frequent word length. Run the program 
with the lLincoln.txt file and again with some essay which you have written. 
(This exercise illustrates the somewhat crude methods of textual analysis that were 
first employed by scholars in the humanities using 1960-era mainframe computers; 
newer, more sophisticated methods have been applied to resolving contested issues 
of authorship.) 


CHAPTER 11 


Performance 
Considerations 


A striving for enhanced performance continually 
drives both the software and hardware aspects of computer science and engineering (see 
Patterson and Hennessy). Perhaps we have a given computer system with some perfor- 
mance limitations. Either we can wait for results to emerge slowly using the existing 
software, or we can try to optimize that software or develop different software based on a 
more efficient algorithm. Perhaps we have a given software application with some per- 
formance limitations. Either we can wait for results to emerge slowly using the existing 
hardware, or we can try to optimize that hardware affordably, or seek a different system 
based on faster physical components or more efficient engineering designs. Here, “we” 
can be taken to mean particular individuals, corporate entities, or society itself. 


In this chapter, we first look at some aspects of program optimization using vari- 
ous implementations of a recursive algorithm in Alpha assembly language. We review 
“counting” the number of instructions, which we have done in previous chapters, and 
we show how to use a timing function from the C support library for obtaining a quanti- 
tative measure that could be compared across different system configurations. We will 
not get into the rather contentious area of defining benchmarks, however. It is ultimately 
the real application work, say weather prediction, not some arbitrary benchmark, which 
has consequences beyond the financial fate of the software or hardware purveyors 
themselves. That is, the best benchmark of all is one’s own actual application work. 


Later in this chapter, we look at some aspects of hardware optimization, though in 
much less detail than book-length treatments (Patterson and Hennessy, or Stallings). We 
will interpret for our readers just a few architectural aspects of the Alpha, selected from 
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the previous writings of its designers (Alpha Architecture Committee, and Bhandarkar), 
which we feel belong with the programmer’s vantage point taken throughout this book. 


Program Optimization Factors 


If we were writing this book some decades ago, we would be drawing your attention to 
such factors as instruction size, instruction complexity or power, instruction timing, and 
addressing modes. Those factors influence overall efficiency, especially in machines 
with CISC architectures where the highly diversified instruction sets offer many ways 
to perform given tasks. 

The influences of such factors in machines with RISC architectures are somewhat 
more subtle. We now consider certain illustrative software aspects and mention whether 
each factor still matters very much on a RISC-architecture machine like the Alpha. 


Instruction Size 


The VAX instruction set contains two unconditional branch instructions using an offset 
relative to the program counter (PC), as well as a jump instruction that can express a 
generalized target address. The BRB or BRW opcode (one byte) is followed by a one- 
byte or a two-byte signed offset, respectively, to be sign-extended before being added to 
the already-incremented PC to give a 128-byte or 32768-byte branch addressing range, 
respectively. 

The instruction cycle on a VAX (see Figure 2.4) fetches and interprets a byte 
stream. When BRB will suffice, as far as addressing range is concerned, it is more effi- 
cient than BRW because the latter will require one additional byte for addressing to be 
moved from memory into the CPU. That is, the instantaneous throughput demands 
across the bus connecting the memory and CPU are in the ratio 2:3 for BRB and BRW 
instructions. Since improvements in memory speed lag behind ever-increasing CPU 
speeds, a marginal difference such as this between BRB and BRW can actually become 
more important for the newer implementations of an architecture than at the outset. 

The VAX architecture has numerous other instances of different net instruction 
sizes, aS does the Intel x86 architecture also. In fact, the VAX offers over 400 one- and 
two-byte instruction opcodes that use zero to six operand specifiers, with total instruc- 
tion lengths that vary from one to over 50 bytes. 

RISC designs, such as the Alpha, typically reduce or eliminate any disparity in 
size among instructions of similar functionality. With aligned fixed-length instructions, 
One instruction can never spill across two pages of virtual memory and cause a page 
fault which the memory subsystem and/or operating system would need to resolve 
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before the split instruction could be fully executed. Moreover, the digital logic required 
to interpret opcodes and function codes of identical size and bit-positioning can be flat- 
tened, resulting in comparatively fewer gate delays. 


Addressing Mode 


CISC architectures are characterized (or sometimes even caricatured) by their offering 
numerous addressing modes, some of which we outlined in Chapter 4. While these 
modes may differ in the number of instruction bytes needed to express them, the num- 
ber of time-sequential memory references affects performance more significantly than 
instruction size in this instance. Register direct addressing is fastest, register indirect 
addressing is slower, and any further stages of indirection are slower still. 


The Alpha, following RISC principles, offers only register direct and displace- 
ment addressing modes. Register indirect addressing results from a zero displacement. 
The same performance impact applies as for CISC architectures. A program should 
keep its most frequently needed quantities, whether data or addresses, in processor reg- 
isters instead of information units in memory to the maximal extent possible. The per- 
formance difference in favor of the register is only lessened somewhat, not eliminated, 
if a cache location holds a copy of the value from the information unit in memory. For 
example, even a 90% cache hit ratio with a 5-fold faster cache than main memory 
response yields an effective memory access time of 0.3 tain (where tain is the access 
time of main memory). 


Instruction Power 


The word powerful occurs frequently in marketing materials about computer systems. 
CISC architectures were often thought to derive power both from their large number of 
distinct instruction types and from the amount of data manipulation which a single 
instruction could perform. For example, the three operands of the MOVC3 instruction 
in the VAX architecture specify the source address of string data, the byte length, and a 
destination address to which the string will be copied. Copying N bytes using simpler 
VAX instructions would require a tight loop containing, at a minimum, some means of 
loop control involving a conditional branch instruction as well as the data copying per 
se. However few, those instructions would have to be processed through the entire 
instruction cycle (Figure 2.4) a total of N times. With MOVC3, there is only one 
instruction cycle, albeit a very extended one. Remember, every instruction fetch 
involves at least one memory access, which can be speeded up though not eliminated if 
there is an instruction cache. 
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RISC architectures have a reduced number of instructions, but the instructions that 
remain are not necessarily constrained to have reduced capabilities. The intricacies of 
architectural design for a processor will inevitably endow some few instructions with 
multiple and often very interesting applications. Examples in the Alpha architecture are 
the cmpbge instruction, which can make eight 8-bit comparisons in parallel, and the 
cmov and fcmov instructions, which can eliminate the need for a branch instruction in 
certain circumstances. Seeking out and exploiting such versatile instructions can have 
remarkable overall leverage when an application or a compiler is being “ported” for a 
new architecture. 


Program Size 


The evolution of microelectronics has long since diminished concerns about mere pro- 
gram size, because the cost of memory storage has decreased faster than voluminous- 
ness of operating systems and application programs has increased. Much of the bulk 
consists of graphical elements, data structures, handlers for error contingencies or rare 
cases, and the like. 

Computer scientists still rigorously analyze the design of algorithms in order to 
determine the irreducible number of unit operations required in relation to the size or 
dimensionality of a problem. If a problem is dimensionally 10 by 10, it may in fact be 
preferable to replicate one instruction sequence 10 times even though the total program 
size increases. When a compiler repeats what would have been the unique body of a 
loop, the resulting optimization is called unrolling the loop. Total execution time may 
decrease because the overhead of initializing, updating, and testing some loop control 
variable has been eliminated, as well as a backward-pointing branch instruction. Unroll- 
ing is not guaranteed to yield improvement, however, because the lengthier unrolled 
instruction sequence may not benefit as much from cache retention as could the shorter 
sequence in the body of a loop. 

Loop unrolling is generally beneficial on RISC systems because the cost of a 
branch instruction is relatively high. The C compiler for the Alpha offers loop unrolling 
as one of several possible types of optimization, and Chapter 12 contains an example. 


Use of In-Line Functions 


Some compilers have the capability of not only unrolling certain loops, but also insert- 
ing the body of certain routines in-line where otherwise a function call would be posi- 
tioned. We saw in Chapter 7 that procedure calls involve several sources of overhead: 
the jsr/ret instructions, the need to save/restore registers, and the need to handle 
arguments in very precise ways. In-line expansion of functions can reduce such over- 
head, with the chief cost being larger program size. 
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Recursion and Other Factors 


Later in this chapter, after we have given a highly simplified description of hardware 
instruction pipelining and the impediments to peak throughput in a pipeline, we will 
examine another software-based optimization, namely, instruction reordering. The 
assemblers for the Alpha offer this option. 

We turn more immediately to recursion, a very standard topic in computer sci- 
ence. Recursion requires some of the overhead associated with procedure and function 
calls. A non-recursive algorithm, if one can be devised to solve a problem equivalently, 
may run appreciably faster. The non-recursive program implementation may or may not 
seem inferior in terms of design, adaptability, and maintainability considerations. For 
transparency, we choose two simple examples of applications from finite number the- 
ory: the Fibonacci numbers (in the next section) and the factorial function (in some of 
the exercises at the end of this chapter). 


Fibonacci Numbers 


Mathematical modeling is not a recent development, as Knuth makes plain in retelling 
the inquiry, first written by Fibonacci in the year 1202, “How many pairs of rabbits can 
be produced from a single pair in a year’s time?” If the rabbits never die, then the popu- 
lation may grow to levels in each reproductive generation that are in proportion to the 
integer sequence 
by dig dy Duta, Oy ite 2 kn hs. aan 

known as the Fibonacci numbers. Centuries later the release of European rabbits in 
Australia, in the absence of effective predation, indeed led to a runaway population. 
Modern population biologists of course consider in their mathematical models such 
additional factors as food supply, predation, and death. 

How can the sequence of Fibonacci numbers be characterized? First, the corre- 
spondence between the index position n and each Fibonacci number F, in the sequence 
goes as follows: 

n 1 2 3 + 5 6 ij 8 9 


Fn l l 2 3 5 8 13 21 34 


Second, if we notice that a Fibonacci number is equal to the sum of the two preceding 
Fibonacci numbers, we can propose a formal definition 

Fy =1 and Fo9=1 and F,=Fy-1 + Fn-2 for n>2 
This definition contains a recurrence relationship, by means of which a member of the 
sequence can be computed from knowledge of previous members in the sequence. 
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An algorithm based on such a recurrence relationship can usually be written most 
obviously and readily using recursion: 


of n= 1 0r 2 

then F(n) := 1 

else F(n) := F(n-1) + F(n-2) 
end if 


This pseudocode directly implements the recurrence relationship for the Fibonacci 
numbers. 


Local Variables in Recursion 


Recursion requires the preservation of essential context at every level of calling. Physi- 
cally distinct storage must be provided anew at each level, most practically on a stack, 
to hold many sorts of items asscciated with procedure calling as well as to allocate 
space for local variables. 

If we return briefly to the calling conventions and prescribed models of stack 
usage introduced in Chapter 7, we can diagram the arrangement of information on the 
stack when a routine calls itself many times as in Figure 11.1. We showed in Chapter 7 
how a frame pointer (FP) could be useful, in addition to the main stack pointer (SP), 
particularly for addressing local variables that have storage defined in a stack region. 


Arguments (beyond first six) 
Local variables 
Saved registers and return address 
Arguments (beyond first six) 
irst six) 












larger 












addresses Variables of the caller’s caller 


Variables of the caller 


Local variables 
Saved registers and return address 


: FRAME(SP) or : 0(FP) 
Current level’s variables 


Figure 11.1 Local variables and recursive call frames 


smaller 


addresses : O(SP) etc. 


In Figure 11.1, memory addresses increase toward the top of the diagram. The 
stack is built in the direction of decreasing addresses. Three call frames are shown. Note 
that if more than six arguments are passed, these are placed on the stack by a calling 
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level in such a way that they are also accessible to a called level. In a sense, therefore, 
each cluster of arguments lies within the scope of both members (the caller and the 
called) of a specific pair of instances of the recursive routine. Both members also share 
the first six arguments passed in registers R16-R21 and/or F16-F21. 

Any variables whose values may change at any level of recursion must be copied 
at any level where their current values must be remembered. Why? The general struc- 
ture of a recursive routine is 


Entry point 

Early portion # Using original parameters from caller 
Recursive call point 

Later portion # May still need original parameters 
Exit point 


Each recursive invocation of the routine has similar needs. Anything that may still be 
needed below the recursive call point must be preserved as a local variable on a stack or 
stored in a register that would be guaranteed to be saved and restored at the next level of 
recursion. 


Three Fibonacci Implementations 


We now proceed to present and analyze three different implementations of a function 
FIBx(n) that will report back in register RO the value of the nth number in the sequence 
of Fibonacci numbers. We will link the three variant functions (FIB1, FIB2, FIB3) to a 
C main program (TESTFIB) for demonstration and testing. 


Full Procedure Call 


FIB1, the first variant of a function for computing the nth member of the sequence of 
Fibonacci numbers, derives directly from the defining recurrence relation (see Figure 
11.2). We elect to have the argument n passed by value, i.e., as an unsigned quadword 
value in register R16. Because each level of recursion calls the function FIB1 twice in 
succession, one local variable LOC1 is used to retain the value of n-1 while F (n-1) is 
being computed and a second local variable LOC2 is used to retain the value of F (n-1) 
while F (n-2) is being computed. 


ee 
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/* FIB1 Full call for Fibonacci numbers (Unix) */ 
LOC1 = =] *S # 1st local: N-1 
LOC2 = =2*8 # 2nd local: F(N-1) 
VARS = 2 # Number of local variables 
REGS = 2 # Need to save 2 (incl R26) 
STACK = REGS + VARS # Quadwords needed 
FRAME = ((STACK*8+8) /16) *16 
.text # Section for program code 
.align 4 # Octaword alignment 
.set noreorder # Disallow rearrangements 
.globl f1ibl # These three lines 
.ent ELDI # mark the mandatory 
fibi: = function entry 
ldgp Sop,0 ($27) # Load the global pointer 
lda Ssp,-FRAME(S$sp) # Allocate stack space 
stq $26, 0($Ssp) # Save our own exit address 
stq Sfp,8(Ssp) # Save FP of the caller 
.mask 0x04008000,-FRAME # Saved registers R26,R15 
.frame S$sp,FRAME,$26,0 # Describe the stack frame 
.prologue 1 # Say that $gp is in use 
lda Sfp, FRAME (S$sp) # Define a frame pointer 
cmple S16 , 2% SU # Set F(N)=1 
blbs $0,done # if N=l or 2 
subg 616,;1,$16 # Compute N-1 
stq $16; LOC1 (Sfp) # and remember it 
jsxr S26;fipl # Compute F(N-1) 
ldgp Sgp,0($26) # Restore global pointer 
stq $0, LOC2 ($£p) # Remember F(N-1) 
ldq $16,L0C1 (stp) # Get N-1 again 
subg 516,14, 516 # R16 = N-2 argument 
jsr S26 Sipi # Compute F(N-2) 
ldgp Sap, 0 ($26) # Restore global pointer 
ldq $16,L0C2 (Sfp) # Get F(N-1) again 
addq $16, 50,50 # and add to F(N-2) 
done: # RO = F(N) now 
ldq Sfp,8(Ssp) # Restore FP of the caller 
ldq $26, (Ssp) # Restore exit address 
lda Ssp, FRAME (sp) # Restore stack level 
ret S31, (826) «ih # Back to the caller 
.end fibi # Mark end of function 


Figure 11.2 FIB1: Computing Fibonacci numbers using full procedure calls 
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If we call FIB1 from a C program for n=4, a value beyond 2, this first invocation 
will set up two successive self-calls to compute F (4-1) =F (3) and F(4-2) =F (2 )=1, 
Of these two calls, the latter can directly return a value but the former must set up two 
more self-calls, to compute F(3-1)=F(2)=1 and F(3-2)=F(1)=1. All told, the 
entire course of recursion includes those four internal self-calls in addition to the initiating 
outside call from the C program. 


Neither the Unix nor the OpenVMS variant of FIB1 explicitly contributes any- 
thing to a data section. It is axiomatic that the data section of a recursive routine should 
usually contain just read-only data, like the reciprocal of a divisor for implementation 
of integer division using multiplication on the Alpha. Any label and address in the data 
section will refer to only one information unit of storage seen and shared by every 
recursive invocation of the routine. Local variables on the Stack, in contrast, have differ- 
ent addresses because the positions of the SP and FP registers are different for each 
recursive invocation. Linkage address information does occur in the data section (Unix 
variant) or the linkage section (OpenVMS variant), such as that required in FIB1 to 
build the j sr instruction referring to FIB1 itself. 


In the Unix version of FIB1, each separate self-call involves certain aspects of 
overhead: 


e executing the j sr instruction; 

e allocating stack space for the local variables LOC1 and LOC2: 
e saving the global pointer at the outset; 

e resetting the global pointer after each self-call; and 


e executing the ret instruction. 


Although some of these steps may seem unnecessary for this particular function, which 
is very simple, the vendor-defined calling convention requires strict adherence to this 
scheme at the source-code level because every step is essential in more complicated sit- 
uations. 

In the OpenVMS version of FIB1, similar overhead is involved in establishing a 
valid base register which is used instead of a global pointer to assist in addressing infor- 
mation in the linkage section. Moreover, the $ca11 macro has to produce an instruc- 
tion to put the appropriate argument information into register R25. 

Even though one ought not to circumvent any of this call overhead, you may well 
wonder whether we could reduce its adverse impact on performance in any legitimate 
way(s). Please read onward. 
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Internal Local Subroutine Call 


FIB2, the second variant of a function for computing the nth member of the sequence of 
Fibonacci numbers, also derives directly from the defining recurrence relation (see Fig- 
ure 11.3). Unlike FIB1, however, FIB2 contains a streamlined implementation of the 
recurrence relation that can be called as a local subroutine (fib) using a bsr instruc- 
tion instead of a full function call using a j sr instruction. 


/* FIB2 Internal call for Fibonacci numbers (Gnas) */ 
DEPTH = 100 # Biggest N 
VARS = 2*DEPTH # Number of local variables 
REGS = ł # Need to save 1 (incl R26) 
STACK = REGS + VARS # Quadwords needed 
FRAME = ((STACK*8+8) /16) *16 
.text # Section for program code 
.align 4 # Octaword alignment 
.set noreorder # Disallow rearrangements 
~,Giobl F£ib2 # These three lines 
.ent fip2Z # mark the mandatory 

fib2: = function entry 
ldgp Sgp,0($27) # Load the global pointer 
lda Ssp, -FRAME ($sp) # Allocate stack space 
stq $26,0(S$sp) # Save our own exit address 
.mask 0x04008000,-FRAME # Saved registers R26,R15 
.frame S$sp,FRAME,$26,0 # Describe the stack frame 
prologue 1 # Say that $gp is in use 
lda $2, FRAME (sp) # Define local stack pointer 
bsr S26,f£ib # Call internal routine 
ldq $26,0(Ssp) # Restore exit address 
lda Ssp, FRAME ($sp) # Restore stack level 
ret a3, (S26) «+ # Back to the caller 

TiDi cmple STE.: 2, 90 # Set F(N)=1 
blbs $0, easy t if N-I or 2 
lda $2,-~16 ($2) # Use 2 quadwords each time 
subg S16,1,515 # Compute N-1 
stq $16,8($2) # and remember it 
stag $26,0 ($2) # Save return address 
bsr $26, £1b # Compute F(N-1) 
ldq $16,8($2) # Get N-1 again 
stq 50,8 ($2) # Store F(N-1) 
subq Sl6,12546 # Compute N-2 
bsr S26 ha # Compute F(N-2) 
ldq S16 ,-8 ($2) # Get F(N-1) again 
addq $16,$0,$0 # and add to F(N-2) 
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ldq S26, 0 (92) # Restore return address 
lda SA16 (52) # Release local stack space 
easy: ret S351 (S263 ri # Back to outer routine 
.end Fib2 # Mark end of function 


Figure 11.3 FIB2: Computing Fibonacci numbers using an internal recursive routine 


We elect to have the argument n passed by value, i.e., as an unsigned quadword 
value in register R16. Because each level of recursion calls the subroutine f ib twice in 
succession, a local variable is needed to retain the value of n-1 while F (n-1) is being 
computed and then to retain the value of F(n-1) while F(n-2) is being computed. 
Another local variable is needed to store the respective return address (from register 
R26) for each entry into fib. 

Those local variables are implemented in FIB2 with a “local stack” using a large 
allowance claimed before the inner fib routine is entered for the first time. Register R2 
is chosen as a local stack pointer. This method complies fully with the calling standard 
in the Unix programming environment, where a routine (here, FIB2) is supposed to 
alter the main stack pointer only with single instructions at entry and exit. Although the 
calling standard in the OpenVMS programming environment is less restrictive, we use 
the same local stack method in the OpenVMS variant of FIB2 (on the CD-ROM that 
accompanies this book). 

You should notice that the inner recursion in fib includes fewer instructions than 
the entire FIB1 function. Since a bsr instruction uses addressing relative to the pro- 
gram counter, fib does need to use the generalized addressing provided through the 
global pointer (Unix) or other base register (OpenVMS) required to support a jsr 
instruction as in FIB1. In turn, then, that additional register does not need to be saved 
and restored. 

The vendor-sponsored calling standards apply to procedure calling that involves 
external entry points like that of FIB2 rather than internal entry points like fib in 
FIB1. Stated another way, the internal details of the FIB2 function may achieve signifi- 
cant economies as long as the outer shell conforms to the calling standards. 


Non-Recursive Algorithm 


FIB3, the third variant of a function for computing the nth member of the sequence of 
Fibonacci numbers, avoids recursion altogether (see Figure 11.4). We elect to have the 
argument n passed by value, i.e., as an unsigned quadword value in register R16. No 
local variables are required for storing return addresses or intermediate values. 
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/* FIB3 Fibonacci numbers without recursion (Unix) */ 


.text # Section for program code 
.align 4 # Octaword alignment 
.set noreorder # Disallow rearrangements 
.globl f1ib3 # These three lines 
.ent Fib3 # mark the mandatory 
ribs = function entry 
ldgp Sgp,0($27) # Load the global pointer 
.frame Ssp,0,$26,0 # Describe the stack frame 
.prologue 1 # Say that $gp is in use 
mov 1,$0 # Assume 1 
subg S16 ,2,516 # R16 = number of additions 
ble $16,done # Done if N was 1 or 2 
mov SOS # R1 = F(n-2), RO = F(n-1) 
loop: mov 50, Sie # Save this F(n-1) 
addq S052, 90 # R1 = newest F(n) now 
mov S22 Si # Make this F(n-2) now 
subq $16,1,$16 # Count down... 
bgt $16,loop # .,.until finished 
done: ret $31, (S26) «1 # RO = F(N) that was wanted 
.end tiD # Mark end of function 


Figure 11.4 FIB3: Computing Fibonacci numbers without recursion 


Like FIB1 and FIB2, the FIB3 function handles the first two Fibonacci numbers 
specially with an immediately returned value of 1. In general, each Fibonacci number is 
computed by direct addition of adjacent pairs of values, starting from the first two. 


F3 = F1 + F2 
F4 = F3 + F3 


Each number in the sequence must be retained long enough to be used twice in such 
additions. That requirement is met using the instruction which hands over F (n-1) in 
register R22 from one cycle to become F (n-2) in register R1 for the next cycle, as 
though a juggler were to pass the ball just caught from one hand to the other before 
tossing it up again. The loop in FIB3 that performs additions must run n-2 times to 
compute F (n). 

In contrast to FIB1 and FIB2, the FIB3 function does not compute early members 
of the sequence over and over. This improved efficiency dramatically reduces execution 
time, as we illustrate in the next section. 


TESTFIB: Showing the Cost of Recursion 
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TESTFIB: Showing the Cost of Recursion 


In order to test the three functions for computing Fibonacci numbers, we can use a short 
C program (Figure 11.5). For dramatic effect, we call the variant functions in the order 


from fastest (FIB3) to slowest (FIB1). 


/* This C calling program will be linked to the 
functions that generate Fibonacci numbers. * / 


#include <stdio.h> 
#include <time.h> 
main () 

{ 

#ifdef _ VMS 
#include <ints.h> 


uint64 fibl(), fib2(), fib3(), Fn, n; 
#define L "L" 
#else 
unsigned long int fibl(), fib2(), fib3(), Fn, nz 
#define L "1" 
#endif 
aloek t t 


printf ("Which number? "); 


scan£("%" L "d",&n); 

Ct = glock); 

Fn = fib3 (n); 

t = clock() - t; 
printf("FIB3 gives $" L 
t = clock(); 

Fn = Fib2 (n) > 

t = clock() - t; 
printf("FIB2 gives % " L 
t = clock(); 

BY = Tibbs (Ah) s 

t = clock() - t; 


printf ("FIBI gives $ " 


printf ("CLOCKS_PER_SEC = 


} 


"us Ciming $ "= D "uin" Pn otla 


"u, timing $ " L "u\n",Fn,t); 


L "u, taming $ L “y\n* Fati: 


%gd\n", CLOCKS_PER_SEC); 


Figure 11.5 TESTFIB: Calling program for FIB1, FIB2, and FIB3 


When this program is run, the value computed by FIB3 is printed immediately, no 
matter what number along the sequence is sought and no matter which implementation 
of the Alpha architecture is running the system. If the entered value of n is large enough 
(say 30+), the values computed by FIB2 and FIB1 are printed only after a detectable 
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pause. In order to make these comparisons semi-quantitative, the TESTFIB program 
calls the C function clock in much the way one uses a stopwatch to time a race. 


The clock function returns the amount of processor time used by the process 
which calls it. There is an implementation-dependent parameter CLOCKS_PER_SEC 
which can be used to convert the returned value to seconds. In the Unix programming 
environment on the Alpha, the unit of time is one microsecond (CLOCKS_PER_SEC 
=1000000). In the OpenVMS programming environment on the Alpha, the unit of 
time is 10 milliseconds (CLOCKS_PER_SEC=100). 


Some results from running TESTFIB on various Alpha implementations are 
shown in Table 11.1. The loops in the variant functions are so short that all of their 
instructions are retained in cache(s) almost all of the time. Thus the relative perfor- 
mance is dominated by the speed of the instruction cycle in the various processor imple- 
mentations. The results for Unix and OpenVMS environments should not be compared 
directly without first recognizing that the Unix variant of FIB1 is two instructions 
shorter than the OpenVMS variant (27 versus 29 instructions, respectively). 


Table 11.1 Comparative Execution Times for Fibonacci Routines 


CPU Seconds* to Compute the 35th Fibonacci Number 
lc clans Bae ncn ied a A 


Alpha Configuration FIB1 FIB2 FIB3 
125-MHz Unix workstation 4.0 2A aU A 
150-MHz OpenVMS workstation 3.6 1.7 << 0.1 
275-MHz OpenVMS server Zn 0.9 << 0.1 
500-MHz Unix workstation 0.8 0.4 << 0.1 


x Measured when the respective systems were as free from other jobs as possible. 


Is it reasonable that the clock function reported zero elapsed time to us for the 
FIB3 function? The slowest processor tested here runs at 125 MHz, or 8 nanoseconds 
per clock cycle. Even though the clock function expresses time in microseconds in 
the Unix environment, the minimum reportable elapsed time is actually 17 milliseconds 
(the reciprocal of the 60-Hz frequency of electricity supplies in North America) for this 
particular Alpha system. 


On the assumption that one instruction could be completed on every clock cycle, 
the minimum elapsed time of 17 ms reportable by the clock function on that system 
would allow for execution of some two million instructions. For n=35, the FIB3 func- 
tion would require fewer than 200 instruction cycles. It would be highly unlikely that 
the clock function itself would require more than a few hundred more instruction 
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cycles. We have thus demonstrated how very efficient the non-recursive algorithm 
proves to be in this instance. 

In contrast, we have also shown the cost of recursion very Clearly in this extreme 
case where the call overhead (including saving and restoring things) represents a large 
proportion of the total routine length. If the body of a recursive routine were much big- 
ger in some other example, the relative effect of the overhead would then seem smaller. 

Before you conclude that recursion is a bad concept, however, please look again at 
the data in Table 11.1. We have two recursive functions, FIB1 and FIB2, where one of 
them can slash the execution time by more than a factor of two. Here we can see why 
software engineering can sometimes have a bigger immediate effect than hardware 
engineering. That is, we may be able to speed up a calculation two-fold today by modi- 
fying the implementation of a given algorithm. If we were already using the fastest 
computer we could obtain or afford, we might have to wait some number of years for an 
equivalent hardware speed-up. 


Assembler Optimizations 


Some assemblers—and, by extension, compilers also—can automate the improvement 
of a program’s efficiency in various ways. An assembler or compiler must never subvert 
or alter a programmer’s intent. Yet since even RISC instruction sets provide alternate 
ways of accomplishing a given task, that variability gives scope for numerous imple- 
mentations which would all lie within the programmer’s constraints. 

We have previously drawn attention to certain substitutions that the Unix assem- 
bler automatically makes. When we use an 1da instruction, the assembler may gener- 
ate one or two different instructions, such as an 1dq instruction using a known 
displacement from the global pointer to access some information unit in the data section 
where the assembler already has put the address in question. The Unix assembler typi- 
cally implements a procedure call by putting the target address into register R27 using 
an 1dq instruction, not an 1da instruction. 


jsr $26,address becomes ldq $27,disp(S$gp) 
jar $26, ($27) 


Similarly, the 1dgp pseudo-instruction at the assembly-language level expands into 
two machine instructions, while certain other pseudo-instructions like mov have one- 
instruction implementations. 

We conducted some empirical studies of the FIB1 function linked with TESTFIB, 
FIB2, and FIB3 (which were not themselves investigated in any further detail), using 
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the slowest Unix and OpenVMS systems mentioned in Table 11.1 in order to magnify 
the timing differences. We will use such studies as the basis for developing some con- 
cepts of optimization through software-related strategies. 


Instruction Rewriting 


Recall that we achieved about a two-fold improvement in reduced running time when 
we transformed FIB1 into FIB2 manually. What capabilities for performance improve- 
ment are built into the Alpha assemblers? 

Until now, we have been inhibiting the Unix assembler from applying any optimi- 
zations by using the -00 flag with the cc command. Here are the relative times we 
obtained for computing the 35th number in the Fibonacci sequence with FIB1 (Figure 
11.2) using various levels of optimization: 


cc -g -00 4.0 seconds 
cc -g -01 3.4 seconds 
cc -g3 -03 3.1 seconds 


Optimization with the -01 flag moves FIB1 about one-quarter of the way towards the 
performance of FIB2 (2.1 seconds in Table 11.1), while optimization with the -03 flag 
(which also requires the -g3 flag if the debugger is to have full information) moves 
FIB1 another one-quarter farther along in performance. 

Since we wanted to see how the assembler accomplished such significant 
enhancements, we proceeded to compare the three machine language versions. One 
way to do that involves just a few commands to dbx, the Unix debugger: 


stop in. £151 
run 
Oxaaaa/nni 


where 0xaaaa is the address displayed by the debugger when the FIB1 function has 
been entered, and where nn is the approximate number of machine instructions in 
FIB1. 

We compare the three versions in Table 11.2, where we use horizontal rulings to 
guide you toward seeing which instructions or sequences of instructions have been 
changed and which have not. We abbreviate displacements for all data references with 
just disp, again in order to draw your attention instead to changes in instruction type 
and addressing mode. The marginal notations like <A1 and <A3 mark regions of differ- 
ence, some of which are discussed below. 
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Table 11.2 Unix Assembler Optimizations of FIB1 (no instruction reordering) 


£3 51 * 
ldah 
lda 


lda 
stq 
stq 
lda 
cmple 
blbs 
subg 
stq 
ldq 
Jar 
ldah 
lda 
stq 
ldq 
subg 
ldq 
JSE 
ldah 
lda 
ldg 
addgq 
done: 
ldq 
ldg 
lda 


ret 


-g -00 


Şgp, disp ($27) 
Şgp, disp ($gp) 


SSP, -32 ($8p) 
$26,0(S$sp) 
$15,8(Ssp) 
$15,322 (Sen) 
$16,2,$0 
$0,done 
616,1;Š16 
$16,-8 ($15) 
$27,disp ($gp) 
s36. (527) 
Sgp, disp ($26) 
Sgp, disp ($gp) 
$0,-16($15) 
$16,-8($15) 
$16,1,$16 
$27,disp($gp) 
S26, (S27) 
Sgp, disp ($26) 
Sgp, disp ($gp) 
$16,-16($15) 
$16,$0,$0 


$15,8(Ssp) 
$26,0(S$sp) 
Ssp,32($sp) 
$31; ($26); 1 















































fibl: fibi: 
ldah $gp,disp($27) |ldah 
Sgp,disp($gp) |lda 
<A1 |fibla: 
Ssp,-32($sp) lda 
stq $26,0(Ssp) stq 
stq $15,8(Ssp) stq 
lda 515 ,.32 ($sp) lda 
cmple $16,2,S$0 cmple 
blbs $0,done blbs 
subq $16,1,$16 subq 
516, -8 ($15) stq 
$31,$31,$31<B1 |bis 
bsr $26,fibla «B1 [bsr 
ldah S$gp,disp($26) |ldq_u 
sgp,disp($gp) |bis 
$0,-16 ($15) stq 
ldq $16,-8($15) ldq 
716,1,516 subq 
$31,0t$sp) <Cl {lda u 
$26,fibla <C1 [bis 
Sgp, disp ($26) bsr 
sgp,disp($gp) |bis 
$16,:-L6 ($1.5) ldq 
216,50 50 addq 
done: 
$15,8(Ssp) ldq 
ldg $26,0(Ssp) ldq 
lda Ssp,32(Ssp) lda 
53.1; (S26) «1 ret 


-g3 -03 


Sgp, disp ($27) 
Sgp, disp ($gp) 
<A3 
$sp,-32 ($sp) 
$26,0(Ssp) 
$15 +8 (Sep) 
915,32 (S35) 
$16,2;60 
$0,done 
$16,1,$16 
$16,-8($15) 
$31,$31,531<B3 
$26,fibla <B3 
$31,0(S$sp) <B3 
Bod O51, 831<83 
$0,-16 ($15) 
$16,-8($15) 
S16: 1:515 
$31,0(Ssp) <C3 
$3 4 $51,83 1603 
$26,fibia <C3 
Soy bok y ees 
$16,-16($15) 
516,60; 50 


$15,8(Ssp) 
$26,0(Ssp) 
$sp,32(S$sp) 
$31, (626), 1 
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For -O1 optimization, the effective loop length has shrunk by two instructions 
because the function calls itself at a new re-entry point marked at <A1 in Table 11.2, 
where we have put a made-up symbolic name fib1a, rather than at the original entry 
point £ib1. Moreover, both j sr sequences (see above) have been replaced at <B1 and 
<C1 in Table 11.2 by bsr instructions alongside either a bis or an ldg u instruction. 

These interpolated instructions are two of the three vendor-suggested no-op 
instructions for the Alpha: 


nop becomes bis Rol yRo Le Bod 
fnop becomes cpys F31, F31, Fal 
unop becomes ldgq_u R3170 (Rx) 


In most Alpha implementations, the universal no-op (unop) and the others incur very 
little instruction cycle overhead since no actual write-back is possible with the read- 
only registers R31 and F31. In fact, an implementation of the Alpha architecture may be 
able to skip execution of unop altogether, and save one execution cycle. 

The no-op instructions that the assembler used in this particular level of optimiza- 
tion have the effect of keeping the overall module length unchanged. Even so, there is a 
time savings because the original style of recursive call involved reading data from an 
address disp ($gp) in the 1dq instruction. The substituted bsr instruction at <B1 and 
<C1 holds a different sort of addressing displacement within the already-fetched instruc- 
tion itself. The effectiveness of this first sort of optimization thus follows directly as a 
corollary from our previous discussion and analysis of addressing modes (Chapter 4). 

For -03 optimization, the Unix assembler eliminates more references to informa- 
tion units in memory and substitutes additional unop and nop instructions at <B3 and 
<C3 in Table 11.2 that keep the overall module length unchanged. Those eliminations 
of the 1dgp pseudo-instruction are possible because the entire modified routine from 
fibla to the ret instruction no longer makes any symbolic references apart from 
branch displacements in the blbs and bsr instructions. Again here, such eliminations 
produce significant savings in overall execution time. 

Let us try to be somewhat more quantitative. For large n, the instructions in the 
FIB1 function which occur from the entry point up through the blbs instruction and 
the instructions from done through the ret instruction are executed 2F;, times, while 
the instructions which occur between the blbs instruction and done are executed 
almost as many as F,, times (see Exercise 11.6). 

For the -O0, -01, and -03 levels of optimization of the FIB1 function, the total 
numbers of machine instructions to be executed are thus approximately some multiple of 
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-00 2*(8+4)+15 = 39 
-01 2*(6+4)+15 = 35 
-03 2*(6+4)+15 = 35 


of which the numbers of machine instructions that involve loading/storing data from/to 
memory are approximately the same multiple of 


-00 2*(2+2)+6 = 10 
-O1 2*(2+2)+4 = 8 
-03 2*(2+2)+4 = 8 


We might speculate that the total numbers of accesses to memory (instruction fetch plus 
data load/store) could partially explain differences in overall execution times. Those 
total numbers are in the ratios 49:43:43, but the performance actually improved more 
steeply, in the ratios 4.0:3.4:3.1 according to our observations. 

Clearly the simplistic methodology of just counting memory references of all 
kinds as a performance proxy does not adequately account for the performance differ- 
ences in efficiency. Please keep this experiment in mind as we proceed toward a brief 
overview of the topic of pipelining later in this chapter. 


Instruction Reordering 


Not only have our programs been processed using the -00 flag with the cc command, 
but they have also contained the .set noreorder directive to the assembler. Here 
are the relative times we obtained for computing the 35th number in the Fibonacci 
sequence with a modified FIB1 function specifying .set reorder instead, again 
using various levels of optimization: 


cc -g -00 4.0 seconds 
cc -g -01 3.4 seconds 
cc -g3 -03 2.8 seconds 


As before, we used commands to dbx, the Unix debugger, to look at the actual 
sequences of machine instructions (Table 11.3). For the FIB1 function, there were no 
differences at the -00 and -01 levels of optimization between the original source 
program containing .set noreorder and the modification containing .set 
reorder. The first two columns in Table 11.3 thus contain the same sequences of 
instructions as the first two columns in Table 11.2. In a more general case, some dif- 
ferences would be expected, such as we observed with the OpenVMS assembler (data 


to be discussed below). 
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Table 11.3 Unix Assembler Optimization of FIB1 (instruction reordering permitted) 


ELD : 


ldah 
lda 


lda 
stq 
stq 
lda 


-g -00 


Sop, disp ($27) 
$gp, disp ($gp) 


Ssp,-32($sp) 
$26,0($sp) 
$15,8(S$sp) 
515,32($s8p) 


cmples16,2,$0 


blbs 


subg 
stq 
ldq 
jsr 
ldah 
lda 
stq 
ldq 
subq 
ldq 


sr 


ldah 
lda 
ldg 
addq 


done: 


ldq 
ldq 
lda 


ret 


S0,done 


$16, 51% 
$16,-8($15) 
$27,disp ($gp) 
$26, ($27) 
Sgp, disp ($26) 
Sgp, disp ($gp) 
$0,-16($15) 
$16,-8($15) 
$16,1,$16 
$27,disp ($gp) 
$26, ($27) 


Sgp, disp ($26) 
Sgp, disp ($gp) 
$16,-16($15) 
$16,$0,$0 


$15,8($sp) 
$26,0(S$sp) 
Ssp,32($sp) 
531, (826) ,3 





fibi: 









Cimi 
lda 
ldah 
lda 





ldah $gp,dđisp ($27) 
lda $gp,disp($gp) 
Fibla: <Al 
lda 
stq 






$sp, -32 ($sp) 
$26,0(S$sp) 
sta $15,8(S$sp) 
lda $15,32($sp) 
cmple $16,2,$0 lda 


blbs $0,done blbs 


Laa 





stq 





stq 















subq $16,1,$16 subq 





stq $16,-8($15) stq 
bis $31,531,831, <BL sr 
bsr $26,fibla <B1 |bis 


ldah Sgp,disp($26)<B1]ldq 
lda ċgp, disp ($gp)<B1ļ|bis 
stq  $0,-16(515) 
ldq  $16,-8($15) 
subo $16,1,$16 
ldq_u $31,0($sp) 
$26,fibla 


stq 
ldq 
subq 
cs 
<Cl1 |bsr 
ldg 
ldah 
lda 








bsr 


ldah $gp,disp($26) 
lda $gp,disp($gp) 
ldq $16,-16($15) 
: done: 
ldg 
ldg 
lda 
ret 







$15,8(Ssp) 


$26, 0(Ssp) 
Ssp,32(Ssp) 
S31, ($26) ; 1 


cmple 


-g3 -03 
and fibib: <A3 
Ssp,-32 ($sp) <A3 
Sgp, disp ($27) 
Sgp, disp ($gp) 


$26,0(S$sp) 
$15,8($sp) 
S16, 2,50 
$15,32 ($sp) 
SO,done 
$27,aisp(Sgp)<C3 
616,14; 3.6 
$16,-8($15) 
$26,fibla 
B51 peo by SO. 
$16,-8($15) 
631.9831.) 554 
$0,-16 ($15) 
$27,disp($gp) <E3 


<B3 
<B3 


§16,1,$16 


$26,fibla <F3 
$16,-16($15) <F3 
Sgp, disp ($26) 
$gp, disp ($gp) 


$16;$0,; $0 
$26,0($sp) <G3 
515,8 ($sp) «<63 


Ssp,32($sp) 
531, (S267 «1. 
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The horizontal rulings and marginal indicators like <D3 are positioned differently 
in Table 11.3 than in Table 11.2 to draw your attention to the considerable amount of 
reordering of instructions in the third column for the -03 level of optimization. 

When the source module contained .set reorder, the -03 optimization not 
only rewrote some of the instructions but also reordered the sequence of instructions, 
subject to a strict compliance with the programmer’s intent. It may at first seem surpris- 
ing that the recursive loop has been extended back to 27 instructions, of which 10 
involve loading/storing data from/to memory, yet the performance is better than any 
other arrangement in Tables 11.2 and 11.3. 

Anticipating again our brief treatment of the topic of pipelining below, we want 
you to see a common pattern in certain reorderings in the third column of Table 11.3. At 
<A3, <B3, <D3, or <G3, a new value is established in register $sp, $0, $16, or $16 ear- 
lier in the sequence than for the other levels of optimization. The number of intervening 
full instruction execution cycles is now 2, 1, 3, or 2 in contrast to 0, 0, 0, or 1 at those 
places. Please keep these experimental results in mind as we proceed toward a brief 
overview of the topic of pipelining later in this chapter. 

The OpenVMS assembler for Alpha can be invoked with the /opt imize quali- 
fier to request that instruction reordering be permitted during the assembly process. 
When we did this with an OpenVMS version of FIB1, we found similar movements of 
instructions earlier in the sequence. The modification analogous to going from the first 
to third column in Table 11.3 involved two interpolations of 1 and 2 instructions, but 
those improved the overall performance only a few percent. 

We may thus conclude that separating the instruction which establishes a new 
value in a register from subsequent instructions which depend upon that new value has a 
definite impact upon performance, since the reordered recursive loop performs better 
than slightly shorter loops that have not been reordered. 

The Unix cc command has another capability to invoke a post-linking optimizer, 
with a combination of -non_shared and -om flags, which can remove some of the no- 
Op instructions from an optimized routine such as the variant of FIB1 in the third column 
of Table 11.3. We tried that for this particular case, but obtained very little net improve- 
ment in performance even though the recursive loop had been shortened. This observation 
suggests that no-op instructions may have a rather neutral overall impact on performance. 


Instruction Pipelining 


Most computers, whether RISC or CISC, have come to rely upon instruction pipelining 
as an important means of enhancing performance. In any architecture, the various 
stages of the instruction cycle (such as Figure 2.4) may involve much internal special- 
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ization at the hardware level. Some electronic components perform instruction decod- 
ing, while others handle the retention of results, for example. The basic tenets of 
instruction pipelining involve specialization of function and an attempt to keep all the 
functional units fully productive at all times. An analogy is often drawn to the industrial 
model of an assembly line in a manufacturing process. 

An instruction pipeline has a depth, generally taken to be the number of stages 
that handle various distinguishable phases of a complete instruction cycle. The earliest 
implementations of the Alpha architecture (21064 chip) employed seven stages in the 
pipelined digital logic for integer instructions: 


Stage 0: Fetch 

Stage 1: Swap dual-issue instructions; predict branch direction 
Stage 2: Decode 

Stage 3: — Read source register(s) 

Stage 4: | Compute (arithmetic); form new address (branches) 
Stage 5: | Compute (arithmetic); cache lookup 

Stage 6: Write destination register; update cache hit/miss status 


The section of the CPU devoted to floating-point instructions had a few more stages in all. 

Ideally, another new instruction can be started at every clock cycle and another 
instruction already in progress can be concluded. With seven stages, the CPU can be 
working on different phases of up to seven instructions at once, keeping all the physically 
distinct functional stages busy all the time. The throughput can thus be up to seven times 
greater in a pipelined CPU than if one instruction had to be completed before the next 
could begin serially in a non-pipelined CPU, all other factors being the same. We can dia- 
gram these ideas as in Figure 11.4, which shows the progression of instructions (I1, I2, 
...) through the execution stages with the progression of time extending from left to night. 





Time Interval 


1 2 3 4 5 6 7 8 9 10 11 12 13 14 





Buet dl I2 13 14 15 6 Am 19 mw 
Stage 1 it IZ I3 I4 is Ic 17 IS I9 FIO 
Stage 2 EL £2 T3 i4 if I6 I7 we I9 
Stage 3 Ti to * * * T3 IA S I6 I7 8 
Stage 4 I1 tT I3 I4 I5 I6 17 
Stage 5 T1 t2 Iso 24 I5 I6 
Stage 6 TA TZ 2a TA D 


NE eee ee err ee 


* Latency of three cycles assumed to be caused by instruction 2. 


Figure 11.6 Seven-stage pipeline 
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In this diagram, we show the consequences of assuming that instruction I2 has to 
wait three extra cycles before it can be issued, i.e., be allowed to pass through stages 4— 
6 that execute an instruction and store its results. Since stages 0-3 already hold instruc- 
tions, no further instructions can enter the pipeline until the blocked instruction I2 
begins to move along again. If we are watching to see the succession of instructions 
being completed, we will observe a “bubble” instead of a steady flow because of pipe- 
line stalling. 

On the Alpha, stages 0 through 3 may be stalled, while stages 4, 5, and 6 always pro- 
ceed one after the other in three successive time intervals. In Figure 11.4 the instructions 
I5, 14, I3, and I2 are stalled at stages 0, 1, 2, and 3, respectively, for three clock cycles. 


Pipleline Hazards 


Instruction pipeline performance can be degraded by three broad groups of causes 
called hazards: resource conflicts, procedural dependencies, and data dependencies. 

Resource conflicts can arise when instructions in different stages of the pipeline 
simultaneously require, for instance, the same access to memory or one particular func- 
tional unit. Those conflicts can be mitigated by designing a CPU with only one stage for 
reading or writing data and with duplicate functional units. The limiting case with N- 
fold replication of functional units is called a superscalar architecture because it can 
perform as many as N independent scalar operations such as integer additions at the 
same time. 

Procedural dependencies result, in general, from variable-length instruction sets 
and from branches (which typically represent 10-20% of program instructions). Some- 
times partially completed instructions in a pipeline have to be abandoned, but certainly 
in such a way that the logical state of the machine has not been altered. Strategies for 
resolving such breakdowns of the pipeline include instruction buffers, dynamic branch 
prediction logic, delayed branches, and static prediction arranged by compilers. 

Data dependencies may include: data flow dependencies, where an instruction 
wants to read data not yet fully computed and written into a register by another 
instruction; antidependency, where an instruction wants to write into a destination 
that another instruction still requires as a source of data; and output dependency, 
where two instructions both want to update the same destination, but only the logi- 
cally later one (as intended by the programmer) can be permitted to do so. Such 
dependencies can be alleviated using combinations of remedies: intelligent compilers 
to minimize the number of hazards; interlocks; forwarding (deriving source informa- 
tion earlier through additional dedicated pathways in the CPU); and register renam- 
ing, where the CPU has a pool of generic registers that can allow two instructions to 
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proceed as though they work with two independent storage units each transiently 
named RO locally to the separate instructions. 
Here we have just given highlights of a theoretical classification of pipeline hazards. 
We now turn to a more empirical treatment that we can then apply directly to the Alpha. 
Smooth flow of instructions through the pipeline, which is the goal that software 
should always attempt to arrange, can be disrupted in numerous ways, including the 
following: 


e data stalls from the cache and slower memory components; 
¢ branch-induced pipeline flushing; 

e producer-consumer dependencies; and 

e multiple-issue difficulties. 


Such disturbances cause the throughput to falter, in the worst case back to the perfor- 
mance of a non-pipelined CPU paced by the same clock pulses. 


Data Stalls 


Data stalls can occur when writing new data or when reading previously stored data that 
are not currently contained in a cache close to the processor. Several instruction cycles 
can be lost. Frequently used quantities, such as loop control variables and counters, are 
prime candidates to be held in processor registers if enough of them are available. 


The actual effects of data stalls are difficult or impossible to estimate purely 
from a software standpoint, because the number of cache levels, their sizes, their 
speeds, and their organizational and functional natures are implementation details of 
particular hardware systems, rather than known aspects of an architecture. The width 
of bus structures within the CPU and from the CPU to the caches and bulk memory 
are also important. 

Most cache systems work at a level of granularity that may span several adjacent 
information units as defined and viewed by a programmer. Thus it is best to store adja- 
cently any items that will tend to be used in close proximity within any segment of a 
program. The smallest number of adjacent bytes from main memory that can be shad- 
owed in a cache is called the size of a line in that cache, or its block size. 

For the Alpha, it is wise to specify at least quadword alignment for the starting 
addresses of stored data elements or arrays. Also recall, in this context, that the conven- 
tion for the Unix programming environment specifies that stack space be allocated in 
octaword multiples in memory. 
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Branch Effects 


In an ordinary instruction sequence, successive fetches by way of the instruction cache 
or from main memory occur from steadily increasing addresses as the program counter 
(PC) is incremented (always by 4 address units on an Alpha). 


Unconditional branches, subroutine calls, and subroutine returns determine a new 
value for the PC at pipeline stage 4. When those opcodes are recognized at stage 2, the 
two instructions already at stages 0 and 1 must be ignored, i.e., be flushed from the 
pipeline. The pipeline may pretty much empty out—perhaps for a long time unless the 
branch, call, or return target instruction is already held in the instruction cache. 


Some architectures (and compilers) support some form of prefetching which can 
initiate the copying of a new sequence of instructions from main memory into the 
instruction cache on a speculative basis that does not intrude upon current operations, 
well ahead of the location of an actual jump instruction. 


Conditional branch instructions raise the topic of prediction. For the Alpha archi- 
tecture, the original implementations predict all forward conditional branches as not 
taken and all backward branches as taken. If the programmer (or compiler) can ascer- 
tain which Boolean condition for a set of data to be encountered in an algorithm is the 
more probable, then the program should be physically arranged advantageously. A 
wrong prediction produces a bubble in the pipeline of four cycles, assuming that the 
instruction at the target address is held in cache. 


Producer—Consumer Effects 


For the Alpha architecture, no instruction that is a consumer is ever permitted to be 
issued (i.e., to proceed into stages 4 through 6) until every prior producer has estab- 
lished all the source information it needs. For example, when we write 


cmple RS , B2Z,.R1L $ TE R3 <= R2 
blbs R1,smaller ; then branch to smaller 


the cmple instruction is a producer of information in register R1 and the blbs instruc- 
tion is a consumer of that information. It is a feature of the Alpha architecture in partic- 
ular that the information, here a 0 or 1 in register R1, will also be available for other 
consumers, which may be located either in sequence (branch not taken because register 
R1 contains 0) or at the label smaller (branch taken because register R1 contains 1). 
Although that information is not stored into register R1 until the cmple instruction 
reaches stage 6, the actual determination of the 0 or 1 occurs during an earlier stage. 
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Within practical limits, a CPU can contain redundant internal connections that 
permit some measure of helpful spying among the pipeline stages for the sake of reduc- 
ing delays, or latency. In the absence of special spying opportunities, we might expect 
that stage 4 for the blbs instruction would follow immediately after stage 6 for the 
cmple instruction along a time sequence, i.e., that the branch would be detained at 
stage 3 while the compare instruction goes through stages 4, 5, and 6. The latency 
would thus be 2 cycles as contrasted with 0 cycles for a non-consumer instruction fol- 
lowing the compare. Actually the latency can be as little as one cycle in this case 
because the branch instruction can derive its source information through internal CPU 
connections; it does not have to read, literally, from register R1. This is what is meant 
by forwarding 


Once a CPU is fully designed, its latency rules become a known feature of the par- 
ticular implementation, not of the architecture. Bhandarkar gives such data for the earli- 
est Alpha implementation (21064 chip), from which we have excerpted as Table 11.4 
the specific information needed for an exploration of our FIB1 function. 


Table 11.4 Latency Matrix for Producer-Consumer Model of Alpha Instructions in FIB1 





Instruction Type of the Producer 


jsr, bsr, addq, subq, 

Instruction Type of the Consumer ldq ret lda, ldah, nop cmple 
tdg 3 3 2 2 
stq (base register) 3 3 2 É 
stq (data register) 3 3 0 0 
blbs 3 3 1 1 
jsr, bsr, ret 3 3 2 2 
addq, subq, lda, ldah, nop 3 3 i pà 
cmple 3 3 1 2 





Let us now return to a comparison of the first and third columns of Table 11.3. The 
number of instances of producer-consumer pairs involving any latency drops from 10 
for the -00 level of optimization to 3 for the -03 level of optimization. Remembering 
that the early and late sections of FIB1 are ultimately executed twice per execution of 
the middle section, we determined that the relative latency penalties drop from 26 (-00 
level) to 3 (-03 level). In effect, then, the relative number of CPU cycles drops from 
39+26=65 to 35+3=38. That decrease is larger than the relative drop in execution time 
from 4.0 to 2.8 seconds that we observed, however. 
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Another factor would introduce latencies about equally for any level of optimiza- 
tion of the FIB1 function, namely, the effect of data cache misses while the intermediate 
values and return addresses were being stored into never-before accessed information 
units on the stack. The magnitude of that effect depends on the operation and speed of 
the memory subsystem and its interaction with the data cache. 

In the end, because the memory is separate from the CPU, we are left without a 
fully quantitative reconciliation of the performance data that we observed. 


Multiple-lssue Effects 


Up to this point, the execution stages (4 through 6) of an Alpha pipeline have been 
treated as general-purpose, i.e., capable of working on every opcode in the instruction 
set. Actually, considerable differentiation according to function or data type is present 
in the form of separate “boxes” that perform integer operations, floating-point opera- 
tions, loads or stores, and branching. Such physical separateness raises the prospect of 
issuing multiple instructions by working on more than one instruction during stages 4 
through 6, provided that entirely orthogonal physical resources are needed. 

The earliest Alpha implementations (based on the 21064 chip) could follow these 
dual-issue rules: 


e a load/store can execute in parallel with an operate (integer or floating-point); 
e an integer operate and a floating-point operate can execute in parallel; 


e a branch can execute in parallel with a load/store or an operate. 


with exceptions that involve internal conflicts for unique physical resources, such as 
internal bus pathways: 


e a branch cannot execute in parallel with a store of the same format (i.e., using the 
same register set); and 

e a store or branch cannot execute in parallel with an operate of the opposite format 
(i.e., using the other register set). 


Quite different rules, constraints, and latencies apply for later Alpha implementations 
(based on the 21164 chip, for instance) because there are additional functional units, 
such as a duplicate integer-processing unit. Up to four instructions can be issued to exe- 
cute in parallel (stages 4 through 6 for integer data, and the analogous stages for float- 
ing-point data). 
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One further constraint applies to dual-issue and quad-issue implementations of the 
Alpha architecture. Only naturally aligned groups of two or four instructions are han- 
dled at any one time. If one or more of those produce a pipeline stall, no new group of 
instructions will be considered until that group as a whole starts moving along again. 
Compilers can (and should) insert no-op instructions into instruction slots which can be 
predetermined to be nonproductive anyway. 

Other architectures support multiple issuance in different ways. The RS/6000® 
was introduced by IBM in superscalar implementations; that is, a sufficient degree of 
replication exists in the CPU to support two instances of each pipeline stage. The Pow- 
erPC architecture (IBM and Motorola) attempts to prefetch instructions continuously 
into a holding buffer and to dispatch them as soon as appropriate operational units 
become available, subject to the constraint of never altering the programmer’s intended 
logical outcome. 


Multiplication by a Constant 


Bhandarkar notes that the latency of integer multiply instructions is over 20 cycles for 
the earliest Alpha implementations (21064 chip). The digital logic in the multiplier fin- 
ishes four bits per cycle. 

We observed previously (at the end of Chapter 4) that the Unix assembler pro- 
duces short sequences using other integer operate instructions when a multiply instruc- 
tion contains a fixed unsigned constant up to 255 as its second operand, for example: 


mulq $1,59,$0 becomes sll gl #50 # 16*n 
subq $0,$1,$0 # 16n - ñ = 15n 
s4subq $0,$1,$0 # 4*15n - n= 59n 


Register RO is both producer and consumer here, but the producer—consumer latencies 
among integer and arithmetic and shift instructions are small, on the order of one or two 
cycles per pair in addition to a standard execution time. Thus the substituted three- 
instruction sequence can complete much sooner, overall, than the original mulq 
instruction. The OpenVMS assembler does not appear to perform any analogous substi- 
tutions for integer multiply instructions. 

When the second operand is an unknown quantity held in a register, the assembler 
(or a compiler) has no opportunity for such substitutions for a multiply instruction. It 
would be difficult to envision writing logic for a case structure or table lookup for the 
small-integer instances within an allowance of 20 cycles, as compared to the multiply 
instruction itself. Note that a substantial number of non-consuming, non-conflicting 
instructions (e.g., floating-point or load/store) can be positioned and executed “for free” 
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in the time slots after an integer multiplication instruction. A sophisticated Alpha com- 
piler would try to do that through instruction reordering. 


Summary 


We have discussed a broad range of hardware-software performance considerations in 
this chapter, all from a programmer’s perspective. An architecture imposes some perfor- 
mance limitations, but in general the relative weighting of various factors is highly 
implementation-dependent. Software strategies are ideally consonant with hardware 
capabilities. If not, the degradation in performance can be severe. 

Contemporary assemblers (and compilers) contain software logic for optimiza- 
tions whose effects can be substantial. In extraordinary circumstances, hand coding 
could probably attain even larger improvements (like our FIB2 in this chapter), but at a 
high labor cost and at some risk as regards future maintainability. 
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EXERCISES 


11.1 


11.2 


11.3 


11.4 


ILS 


11.6 


11.7 


11.8 


11.9 


Explain in your own words how the introduction of RISC processors has overturned, 
or at least modified, the tenets of machine-level programming for best performance. 


The relationship between the effective memory access time (the average time it 
takes to reference a variable held in memory) for a computer using a single cache is 
given by teş = teache + (1 - A) * tmain Where tegis the effective memory access time, h 
is the hit ratio, and teache and tmain are the cache and main memory access times, on 
the assumption that the hardware first looks in the cache and then in main memory if 
there is a cache miss. Using this relationship, show that the value of 0.3 £,,.;, can be 
calculated for a cache that is five times faster than main memory with a hit ratio of 
90%. How would things change if the hit ratio were 95%? 


Most PCs have two levels of cache. One is part of the processor and the second is 
external to it. Suppose that the relative performance of each level is 1 for the internal 
cache, 3 for the external cache, and 7 for the main memory. That is, if the internal 
cache operates at 10 ns, the second level would operate at 30 ns, and main memory 
would operate at 70 ns. Assuming a hit ratio of 90% for the first-level cache and a 
95% hit ratio for the second-level cache, what is the effective access time for mem- 
ory references? Generalize the formula given in exercise 11.2 to apply to a two-level 
cache as just described. 


How many internal self-calls of FIB1 are required to compute the value of the 6th 
member of the sequence of Fibonacci numbers? 


Modify the FIB1 function by allocating an information unit in the data section 
which can be incremented every time that FIB1 is entered. How many times is FIB1 
called in the course of computing F¢? F35? If you are in the OpenVMS programming 
environment, do you need to add instructions to be sure there is a base register for 
addressing into the data section? 


Show that the total number of passages through the FIB1 function for computing the 
nth member in the sequence of Fibonacci numbers will equal the sum of F, - 1 pas- 
sages which fall through the blbs branch instruction (i.e., full passages) and F, 
passages which take the blbs instruction (i.e., abbreviated passages returning 1 as 
the value). 


If some of the instructions in FIB1 were altered or rearranged, would the routine 
then need only one local variable? Discuss the advantages and disadvantages of such 
changes. 


The FIB1 function uses 4 quadwords of stack per re-entry instance. Could it use 
fewer? Evaluate any advantages or disadvantages. 


Use FIB3 alone with a modified calling program to find out the point in the 
sequence of Fibonacci numbers where the computed value can no longer be repre- 
sented as an unsigned 64-bit value. Explain your method. 
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11.10 Are there opportunities in FIB2 for instruction reordering? Make those changes 
manually and measure the relative improvement in performance. 


11.11 Why are latency rules associated with an implementation, not an architecture? 


11.12 (Individual or group project) Develop and thoroughly analyze Alpha functions for 
the factorial function, both based on the well-known recurrence relation 


Fal =n * Faa 


and based on a simple non-recursive program loop. 


CHAPTER 12 


Looking at Output 
from Compilers 


Tie last chapter began a discussion about program 
optimization from the vantage point of an assembly language programmer, including 
the capabilities for automated optimization which are offered by the assemblers for the 
Alpha. As we know, however, people write programs only rarely in assembly language. 


This companion chapter continues the discussion about program optimization 
from the vantage point of a high-level language programmer, including the capabilities 
for automated optimization which are offered by the compilers for the Alpha. Our treat- 
ment will again take the form of closely examining the compiled output from quite sim- 
ple programs. 


For such simple programs, the full power of modern optimizing compilers scarcely 
comes into play, yet we feel that challenging our readers to imagine themselves compet- 
ing against a compiler in the context of a few transparent examples will reward them for 
having studied Alpha architecture and assembly language. Seeing the choices that can be 
made, all of which are logically equivalent to the programmer’s intent, will again be illu- 
minating as in the last chapter. At the very least, our readers will learn how to get a 
glimpse of what happens “underneath” a compiled high-level language when a program 
is finally expressed in the machine language of the computer on which it is to execute. 


While the material covered in this chapter goes well beyond what the novice may 
need to comprehend the Alpha architecture in particular, the insights gained here are 
intended to convey an appreciation that there are significant differences between vari- 
ous languages and compilers that warrant a careful consideration of their options and 
capabilities. This material may seem difficult, and we know it is seldom found in archi- 
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tecture or programming books, but we feel it fills out the general picture of RISC archi- 
tecture that we have hoped to evoke. 

When one writes a program in a high-level language, the compiled program may 
be efficient or not depending perhaps as much on the adequacy and sophistication of the 
compiler as on the programmer’s choice of algorithm. Moreover, the successful intro- 
duction of an entirely new architecture may depend as much on good compilers as on 
good hardware concepts, especially for a RISC design. Ultimately, some purchasing 
decisions are made on the basis of overall perceived performance in executing the 
buyer’s application programs on the target hardware implementation. 


Why RISC Systems Need Good Compilers 


The very earliest computers, such as the von Neumann machine, bore some resem- 
blance to what are now called RISC architectures. The von Neumann machine even 
fetched two instructions at once, though it did not have sufficient internal parallelism 
for true dual issue. As both computers and software such as compilers evolved, how- 
ever, most systems developed in ways which came to be called CISC systems. The IBM 
801 minicomputer (dating from 1975) is generally credited with being the first actual 
RISC architecture, although some earlier machines such as the CDC 6600 had RISC- 
like features. The main examples of RISC systems began to emerge from the research 
community after 1980, driven in part by difficulties in compiler design posed by the 
sheer complexity of instruction sets and the challenge of keeping pipelines for ever- 
faster CISC implementations from having too many bubbles. 

Compiler technology had already advanced in its theoretical foundations and its 
commercial software engineering by the time the first RISC systems were proposed or 
designed. It is as though the software and hardware realms were both synchronously 
poised to advance to a new sort of computer system. A case might be made that RISC 
ventures could have failed, absent advances in compilers which made their pipelines 
perform adequately in spite of the timing problems with slower load/store instructions 
versus faster register-to-register instructions. Otherwise, it has been argued, the rela- 
tively greater “power” of CISC instructions combined with some pipelining possibili- 
ties would have continued to hold sway since more of the simpler RISC instructions are 
needed to solve comparable application problems. 

MIPS, in particular, became known for its compiler technology as well as for its 
processor architectures. MIPS pursued an approach to compiler systems involving lan- 
guage-specific “front ends” that convert programs into one common intermediate 
encoded form. A common “back end” analyzes and optimizes that intermediate expres- 
sion of the program and then generates actual machine instructions. A compiler system 
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composed of such front and back ends can be modified effectively, on the one hand as 
language standards change or as another language is supported, and on the other hand 
as new hardware implementations require different optimizations. 

Digital Equipment Corporation developed its well-respected GEM compiler tech- 
nology at a time when its line of VAX systems (CISC) was complemented by a line of 
MIPS-based workstations and servers (RISC), and the Alpha architecture was soon 
coming. This GEM technology made it possible to offer compatible language compilers 
for both VAX and Alpha systems, thus facilitating migration of customer applications 
from the 32-bit era to 64-bit systems, especially those for the OpenVMS programming 
environment. 

The GEM compiler design, like the MIPS system, includes the concept of a com- 
pact, universal intermediate representation. In the GEM system, a universal optimizer 
independent of any particular programming language or hardware considerations oper- 
ates upon the intermediate code. Other preliminary optimizations occur through the 
operation of the appropriate language-specific front end. Specific requirements of a par- 
ticular operating system and the target hardware are accommodated at the back end 
when machine instructions, data storage, and linkage pointers are formulated. 

Compilers sometimes provide control over the types of optimizations that they can 
perform. Those optimizations may include not only generally applicable techniques 
such as unrolling loops, but also the deliberate use of special instructions that are imple- 
mented in hardware on some systems or in software in others. The dilemma in the case 
of commercial software is whether to distribute a “one size fits all” version, or many 
versions, or one version optimized for a particular implementation (i.e., model) of a 
computer system. The GEM compiler system provides for such implementation-based 
tuning. 


Compiling a Simple Program 


Comparing the output from the compilers for several high-level languages can usefully 
reveal both similarities and differences. The extremely rudimentary COM_x program 
shown in Figure 12.1 is written in a fashion as parallel from one language to another as 
the syntax rules of FORTRAN, Pascal, and C will allow. We knowingly ignore the ini- 
tialization issue with regard to array B since filling that array’s elements with dummy 
values would require more lines in the program without adding benefits to the main 
points we want to discuss. 
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PROGRAM COM_F 

DOUBLE PRECISION A(12), B(12), C 
INTEGER*4 I 

C=2.71828 

DO: Talta 

A(T) = C¥( B(I) + 3.14155 ) 

END DO 

END 


PROGRAM COM_P; 


VAR 
a, D : ARRAY[1..12] of double; 
c : double; 
i : integer; 
BEGIN 
E s= 2.71828: 
FOR i s 2 TO 20 DO 
afi] <= GF bli] + 3.14159 J); 
END. 
main () 


{ 
adeuble a[12]; Biliz}, č; 
LDE de 
c=2.71828; 
foe (i=l; icii; i+¢) 
ali] = 2*{ Dla] + 3.34159 ); 

} 

Figure 12.1 COM_F,COM_P, and COM_C: Simple program to be compiled 


The COM_x program contains floating-point variables and constants in addition 
to integer quantities. In order to understand this current material, if you skipped over 
Chapter 8, you need to know that Alpha instructions such as a load or a multiply end 
with s or t for single- or double-precision floating-point data, respectively. Also recall 
from Chapter 11 that a typical Alpha implementation contains more than one execution 
unit in the CPU, i.e., that certain floating-point manipulations may be carried out simul- 
taneously with certain integer manipulations. 

We compiled the variants of COM_x in three languages on an OpenVMS system, 
as follows: 
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S fortran/list/mach/float=ieee/noopt com_f 
S pascal/list/mach/check=nobounds/float=ieee/noopt com_p 


a, b : ARRAY[1..12] of double; 


%PASCAL-W-UNWRITTEN, Variable B is read, but never assigned 
into at line number 3 in file COM_P.PAS;1 

%PASCAL-W-ENDDIAGS, PASCAL completed with 1 diagnostic 

$ cc/list/mach/float=ieee/noopt com_c 


Since the system default is /optimize (specifically, l1evel=4) for each of these 
languages, we explicitly specified /nooptimize along with /machine_code and 
/list in order to obtain a listing file containing a representation equivalent to the 
generated machine instructions. Since the OpenVMS system default is to use VAX- 
compatible representations for floating-point quantities, we explicitly specified IEEE 
representations and the corresponding Alpha instructions to be generated. 

The Pascal and FORTRAN (but not C) compilers can produce programs that con- 
tain extra instructions for dynamic bounds checking for arrays, which would detect any 
attempt to access an undefined A(13) or a[13] element in the COM_x program. 
Since bounds checking is the default for Pascal (but not for FORTRAN), we explicitly 
specified /check=nobounds in order to make COM_P more comparable to the 
other variants. 

Note that the syntactical analysis carried out by the front end of the Pascal com- 
piler resulted in a warning that this oversimplified program might be pointless. The pro- 
gram computes values for a vector named a based upon a vector named b whose 
elements have unknown values. 

Table 12.1 organizes the machine instructions for the three language variants of 
COM_.x into parallel columns, with analogous instructions or instruction sequences 
aligned in the same or nearby rows. Some rows have < indicators to draw your attention 
to specific ways the compiled programs handle variables (A, a; B, b; I, i) and con- 
stants (TT; C, c). 
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Table 12.1 Compiler Output for COM_x Programs (OpenVMS; no optimization) 
FORTRAN | Pasea c 
COM_F:: COM_P:: MAIN: : 
LDA SP, -32(SP) LDA SP, -224(SP) LDA SP, -208(SP) 
STO R27; (oP) 
STQ R26, 8(SP) 
STQ R2, 16(SP) 
STQ FP, 24(SP) MOV FP, R17 
MOV SP, FP MOV R27, FP 
MOV R27, R2 
LDA R16, 64(R2) 
MOV 1, R25 
LDQ R26, 48(R2) 
LDQ R27, 56(R2) 
JSR R26, 
DFOR$SET_REENTRANCY 
LDT FO, 80(R2) LDS FO, 28(FP) LDT cC, 32(FP) 
LDQ RO, 32(R2) 
STI F0, C STT FO, C 
MOV 1, Ri MOV 1, R1 MOV Loi 
LDQ R16, 32(R2) STL Ri: = 
STL Rls T BR L$2 BR L$2 
LS1: LS5: 
LDL R16, I 
ADDL R16, 1, R16 
STL R16, I 
LS2: 
LDQ Rie x 232 (R2) 
LDL RIJ: T LDL RD, T 
MULQ R17, 8, R17 MULL RO, 8, RO MULL i, 8, R16 
LDQ R18, 32(R2) 
LDT Fi, © LDT FO, C 
LDQ R19, 32(R2) 
LDL R19, I LDL R16, I 
MULQ R19, 8, R19 MULL R16, 8, R16 MULL i, 8, RO 
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Table 12.1 
FORTRAN 
LDQ R28, 32(R2) 
ADDQ R28, R19, R19 
LDT F10, 104(R19) <B 
LDT PIL, TARZ) <T 
ADDT F10, FIt; F10 
MULT F1, F10, F1 
LDQ R28, 32 (R2) 
ADDQ R28, R17, R17 
STT F1, 8(R17) <A 
LDQ R20, 32 (R2) 
LDL R20, I 
ADDL R20, 1, R20 
LDQ R21, 32(R2) 
STL R20, I 
LDQ R22, 32 (R2) 
LDL R22, I 
CMPLE R22, 10, R22 
BNE R22, LSA 
MOV 1, RO 
MOV FP, SP 
LDQ R26, 8(FP) 
LDQ R2, 16(FP) 
LDQ FP, 24(FP) 
LDA SP, 32(SP) 
RET R26 


SP; 
ey 
F10, 
Fl, 
FO, 


SP, 


SP 
R26 


R16, 
16 (R16) 
24 (FP) 
F10, Fl 
Fly FO 


RO, RO 
t12 (RO) 


224 (SP) 





ADDQ 
LDT 
LDT 
ADDT 
MULT 


ADDQ 


FER 


ADDL 


L$2: 


CMPLT 


BNE 


MOV 
MOV 


LDA 
RET 
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R16 


C 
Sr, RO. RO 
F1, 8(RO) <b 
F10, 24(FP) <N 
El, FLO, Fi 
e Fin Fi 
SP, R16, Ri6 
F1, 104(R16) <a 
Le Tp OL 
Eye dan RO 
RO, L$6 
t; RO 
R17, FP 
SP, 208 (SP) 
R26 


The three source programs were written using very simple syntactical forms for 
each language, and in such a way as to make the high-level language expressions 
resemble one another as closely as possible. 

The data storage allocations made by the compilers for the three languages are dif- 
ferent. FORTRAN puts the declared variables as well as the fixed constants into data 
sections that would be analogous to a “global” data storage scheme for Pascal or C. For 
the style of program layout in Figure 12.1, the latter two languages instead allocate 
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space on the stack for the declared variables of a main program as well as of a callable 
procedure. An array in FORTRAN has default subscripts running from 1 up through the 
declared upper bound. An array in Pascal is declared with explicit lower and upper 
index bounds. An array in C has a lower index bound of zero. Thus when the COM_C 
program declares a [12] there are 12 elements, a [0] through a[11], but the loop in 
the program only accesses 10 of these, a [1] through a[10]. 

Similarities among the three compiled programs would be expected, especially if 
the GEM compilers share a common intermediate language and other techniques. 
Moreover, the calling convention for the OpenVMS programming environment should 
produce certain regularities. We draw you attention to the following points of similarity: 


e Each prologue establishes a frame pointer (FP), although FORTRAN does not use 
it for subsequent addressing in this program because the variables have been 
stored differently than for Pascal or C (see below). 

e All three languages distinguish between the three uses of I, namely, as the loop 
control, as the subscript for B, and as the subscript for A because in a more general 
program these might all involve different variables or perhaps expressions. 

e The instruction sequence for the arithmetic expression and assignment is quite 
analogous, even though there are differences in addressing technique and register 
assignments. 

e Each epilogue restores the stack and sets the value one in register RO to indicate 
success upon exit in the OpenVMS programming environment. 


and to these additional points of difference: 


e The storage declarations for FORTRAN appear similar to those for Pascal and C in 
the source programs, but actually declare variables in a data section (FORTRAN) 
instead of local variables on the stack (Pascal and C). 

¢ Loop control instructions are positioned differently. Note the relative positions of 
the ADDL instruction for the loop index and the CMP instruction that performs the 
exit test for the loop in each language variant of the program. 

e The compiled C program holds the variables c and i in assigned registers, whose 
particular register numbers we could find using the debugger if we needed to. The 
FORTRAN and Pascal programs keep referring back to memory locations for C 
and I in these unoptimized versions. 

e Every time it refers to any of the variables I, C, A A(), B ( ), FORTRAN keeps 
reloading a different base register from 32 (R2), where register R2 holds a copy 
of the procedure value (register R27). 
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e Pascal has assumed IEEE S_ floating for the two constant values, while FORTRAN 
and C have assumed T_floating for them. 


We must emphasize that some of these observations of similarities and differences are 
artifacts of the simplistic nature of this particular program, and should in no way be 
construed as flaws in any high-level language, any corresponding compiler, or the 
Alpha architecture. 

Clearly there are things in each of these programs that a human programmer 
would probably do differently in direct coding in Alpha assembly language. These 
might include: 


° Load the constant 3.14159 (indicated by <1 in Table 12.1) into a floating-point 
register only once, outside the loop. 

e Use only one integer register for I, instead of as many as three. 

e Recognize and avoid the repetitive recourse to a single memory location, as with 
32 (R2) in the FORTRAN variant. 

e Schedule (i.e., reorder) the instructions to minimize data stalls. 


Remember, however, that these languages permit complicated constructs and in some 
instances function calls that may have side effects, where in contrast COM_x has only a 
simple scalar variable or a constant. The compilers are engineered to produce correct, 
efficient programs for the general case. Also note that these are unoptimized programs, 
but that the default setting for the compilers is to perform optimization at level 4 (out of 
a possible 5). 


Optimizing a Simple Program 


As with non-optimized output, so too comparing the optimized output from the compil- 
ers for several high-level languages can usefully reveal both similarities and differences. 

We compiled the variants of COM_x in three languages on an OpenVMS system, 
as follows: 


$ fortran/list=com_f_opt/mach/float=ieee com_f 
$ pascal/list=com_p_opt/mach/check=nobounds/float=ieee com_p 


a, b : ARRAY[1..12] of double; 
%PASCAL-W-UNWRITTEN, Variable B is read, but never assigned 
into at line number 3 in file COM_P.PAS;1 
%PASCAL-W-ENDDIAGS, PASCAL completed with 1 diagnostic 
$ cc/list=com_c_opt/mach/float=ieee com_c 
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The system default is /optimize (specifically, level=4) for each of these lan- 
guages. We specified /machine_code and /1ist in order to obtain a listing file 
containing a representation equivalent to the generated machine instructions. Since the 
OpenVMS system default is to use VAX-compatible representations for floating-point 
quantities, we explicitly specified IEEE representations and the corresponding Alpha 
instructions to be generated. 

Note that the syntactical analysis carried out by the front end of the Pascal com- 
piler resulted in a warning that this oversimplified program might be pointless. The pro- 
gram computes values for a vector named a based upon a vector named b whose 
elements have unknown values. 

Table 12.2 organizes the optimized machine instructions for the three language 
variants of COM_x into parallel columns, with analogous instructions or instruction 
sequences aligned in the same or nearby rows. Some rows have < indicators to mark 
where the compiled programs handle certain variables (A, a; B, b) and constants (7). 


Table 12.2 Compiler Output for COM_x Programs (OpenVMS; level 4 optimization) 


FORTRAN C 
COM Bs ¢ 
LDA SP, -32 (SP) FO, B <b 
LDA R16, 48(R27) Fly 20(R27) <N 
STO R26, 8(SP) CG, 1L6(R27) 
MOV Ly R29 SP, =208 (SP) 
LDQ R26, 64(R27) FO, Fl, FO 
STO R27, (SP) Er A 
STQ R2, 16(SP) R16, A 
STQ FP, 24(SP) FO, Cy, FO 
MOV SP, FP FO, A 


MOV R27, R2 
LDQ R27, 12 (R27) 


JSR R26, 
DFORSSET_REENTRANCY 


LDQ RO, 40(R2) 

LDT Fi, 32(R2) <T 

LDT C, 80(R2) 

LDA R1, 72(RO) 

Diels LS3: 

LDL R31, 144(RO) LDL R31, 104(R16) 
LDL R31, 176(RO) 
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Table 12.2 Compiler Output for COM_x Programs (OpenVMS; level 4 optimization) 


FORTRAN | Pasai c 
























LDT F10, =96(R0) «B 

LDT Pil, =88 (R0) <B Filly -889 (R16) <b 
LDT F12; -80 (R0) <B F12, -80(R16) <b 
LDT Fis, =72(B0) <B F13; -72(RL6) <b 
LDT F14, -64(RO) <B F14, -64 (R16) <b 





ADDT F10, Pi, F10 
ADDT F11, F1, F11 
LDA RO, 40(R0) 
ADDT F12, F1, F12 
MULT F10, C, F10 
ADDT F13, F1, F13 







Pid, Fiy ELI 







F12; Fly F12 
Fis, Fi, Pils 
Ep 4y 2 

R16, 32(R16) 
F14, Fi, Fid 
ELl; Gy FLI 
Ly Zy eo 
dO 

FLZ, 























MULT Pid, €, FLI 
CMPULE RO, R1, R17 
ADDT F14, Fl, F14 
MULT Fl2; C; Fiz 

STT F10, -40(RO) <A 
MULT Fis, Cr F13 

STT F11, -32(R0) <A 
MULT Flá, C, F14 

STT F12, -24(RO) <A 
STT F13, —16(R0) <A 
STT F14, -8(RO) <A 
BNE RII Bed 

















Cy Bae 








Pio, 
F14, 
Pils 
Fiz, 
F13, 
F14, 
R20, 


Ce EL 
G; Fla 
-24 (R16) 
<LO (RLG) 
= (R16) 
(R16) 
L$3 






























MOV FP, SP F15, -88(R16) 

MOV 1, RO F15, F1, Fi MOV iy RO 
LDQ R26, 8(FP) Fl; G Fi 

LDQ R2, 16(FP) F1, 8(R16) 

LDQ FP, 24(FP) 

LDA SP, 32(SP) LDA SP, 208(SP) 

RET R26 RET R26 RET R26 





The outcome of optimization (at the level=4 default for OpenVMS) differs for 
the three languages, both in the sense of differences from one language to another 
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(Table 12.2) and in the sense of how the optimized program for a given language (Table 
12.2) differs from its unoptimized counterpart (Table 12.1). The optimization of each 
language variant of COM_x will now be discussed separately. 


FORTRAN 


The FORTRAN compiler for OpenVMS has revolutionized the organization of the 
COM _F program through loop unrolling and other optimization techniques. Instead of 
the counted loop that the programmer expressed, running from 1 to 10 for the index T, 
the optimized program has a longer loop body with instructions replicated five-fold. 
This modified loop body is set to be executed only twice. The FORTRAN compiler has 
also moved both fixed constants into registers before the loop is entered. 


In this optimization, we see RISC architectural factors being accommodated. Sev- 
eral floating-point registers are used for the replicated functions of loading, adding, 
multiplying, and storing array elements, thus lessening data stalls in the pipeline. The 
compiler takes responsibility for generating all of the appropriately modified displace- 
ments for addressing purposes. The entire course of execution of the program now 
involves only one instance of branching back, instead of ten instances for the unopti- 
mized compilation result. 


The three uses of I and the cumbersome addressing methods in Table 12.1 have 
been transformed almost beyond recognition in Table 12.2. Now a single register RO is 
used for base addressing for both array B and array A. Note the LDA instruction that 
advances by 5*8 = 40 bytes after the use of register RO for addressing elements of B but 
before the register is used for addressing elements of A. The loop exit test is now based 
on an unsigned comparison of the address in register RO against a reference address 
held in register R1. 


All in all, the index variable I has vanished from any physical representation of a 
counting sequence from 1 to 10 when this program runs, yet the transformed program is 
still logically equivalent to the programmer’s intent. 


Finally, the FORTRAN compiler appears to have improved instruction scheduling 
by interspersing the few available integer instructions amongst the floating-point 
instructions and also by starting the store instructions in a deliberate way. 


Pascal 


The Pascal compiler for OpenVMS has reorganized the COM_P program through loop 
unrolling and other optimization techniques. 
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The counted loop has been replaced in an interesting way. The case i=1 appears 
in the prologue and the case i=10 in the epilogue. The new loop in the central portion 
of the optimized program, where instructions are replicated four-fold, will run only 
twice. The Pascal compiler has also moved both fixed constants into registers before the 
loop is entered. 

The programmer’s loop index i has been replaced by the compiler’s new control 
variable I in a register that has the value 1 for the first pass down through the loop body 
and 5 for the second pass. This new index is compared against the somewhat surprising 
value 7 as the exit test for the loop. 

Subscripting with the programmer’s i has been replaced by addressing with com- 
piler-generated displacements. Register RO is used as the base register for addressing 
elements in both floating-point arrays. 

Like the FORTRAN compiler, the Pascal compiler has moved instructions around 
to improve scheduling and pipeline performance. Note, for example, the distant separa- 
tion of the CMPLT and BNE instructions, as well as the early positioning of the MOV 
pseudo-instruction that sets one as an indicator of a successful exit to the OpenVMS 
operating environment. 


C 


The C compiler for OpenVMS quite startlingly optimizes the COM_C program to the 
vanishing point, leaving only the MOV pseudo-instruction that sets one as an indicator of 
a successful exit, followed by the RET instruction. 

Compilers do vary in their thoroughness and in ways that may depend on circum- 
stance. Nothing in the standards for the C language requires that useless operations be 
performed. The COM_C program performs no output; therefore, the compiler is not 
obliged to calculate values which would never be seen. 

We particularly like this latter demonstration. Having seen the body of a program 
disappear, can you now imagine a situation where a misguided programmer could be 
trying to modify a region of C source code that the compiler is going to ignore, perhaps 
caused by a logic error somewhere else that the programmer has not yet found? 


Implementation-Dependent Optimizations 


We also tested several levels of optimization of the COM_C program using the C com- 
piler supplied for Digital Unix, and the body of the program did not virtually vanish as 
it did with the OpenVMS compiler. Instead, the result was more comparable to the 
FORTRAN and Pascal columns in Table 12.2 for OpenVMS compilers. 
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We compiled the COM_C program on a Unix system specifying several different 
optimizations, as follows: 


$ cc -source_listing -machine_code -00 com_c.c 
$ cc -source_listing -machine_code -03 com_c.c 
$ cc -source_listing -machine_code -03 -tune ev5 com_c.c 


We specified -source_listing and -machine_code in order to obtain a listing 
file containing a representation equivalent to the generated machine instructions. The 
system default is -03 and -tune generic. The -tune flag can specify instruction 
reordering that is optimal for various implementations of the Alpha architecture. For the 
COM_C program, the generic tuning is acceptable for any implementation (includ- 
ing those based on the 21064 chip), while ev5 tuning is better for the more advanced 
21164 chip. We take up some of the differences in the successive generations of Alpha 
chips in Chapter 13 in more detail. 

Each column of Table 12.3 contains the extracted opcode and operand fields from 
the listing file produced by one of the three selected optimization specifications. Analo- 
gous instructions or instruction sequences are aligned in the same or nearby rows and 
are marked with < notations at several locations for emphasis.) 


Table 12.3 Compiler Output for COM_C Program (Unix) 


00 03-0 tun ov 


main: : main: : main:: 

ldah gp, main ldah gp, main ldah gp, main 

lda gp, main lda gp, main lda gp, main 

lda sp, -208(sp) lda sp, -208(sp) lda sp, -208(sp) 

stq vola (sp) 

ldq E28, (dp) ldg r28, top) ldq r28, (gp) 
stq r15, 200(sp) stq r15, 200(sp) 
mov Sp, rid mov sp, r15 
stq r26, 1982(sp) <AZ 1106 fl, (S28) <A3 
lda tl; 8(sp) <A2 | ldt C; 8 (r28) <A3 
ldt £1, (#26) <A2 |stq r26, 192(sp) <A3 
lda r2, 88(sp) <A2 | lda cl, SSD) <A3 

1ldt EO, S(e23) rdt G, 8 (E28) <A2 | lda r2, 88(sp) <A3 

stt f9, ¢ <8 (sp) |nop nop 

mov L; re 

stl Tey 2 <0 (sp) 

Liss s LS3: L$3: 


ldl ro, 43 
addq rO, -11, roO 
bge TO Ss 


ŘS 
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Table 12.3 Compiler Output for COM_C Program (Unix) 


L$4: 
LAL 
mulq 
TOE 
LaL 
mulq 
addq 
ior 


ldq 
ldt 
addt 


mult 


addq 
stt 


1dl 
addl 
stl 
br 


LS5: 
mov 
lda 
ret 





50 a 
oi; & 
Ekr Bip wi 
O, B 
ra, i 
SS, By BS 
20, TA, r3 
ri, 26(P3) EL» Cel) ldt 
ro 
fii, &(#1) ldt 
r28; (Gp) F12, Leer) lat 
E10, (E28) £13, Zetri) ldt 
Ely Bie, £L ELG, El, LO 
Eid, 32 (r1) ldt 
Cii; EL, ELL 
tity EL, Bie 
ri, 4240 (01) <B2 | 1da 
fl Gaby i Lae 
El, x2, ra cmpult 
cir 
addt 
addt 
addt 
addt 
addt 
EO, £L; £0 mult 
mult 
mult 
mult 
wp; Tip xi mult 
EO, i112 Gel) Ste 
stt 
Hg. í stt 
£0, ly ZO StU 
cO, T stt 
Les bne 
mov 
T26 192 (r15) ldq 
31, ZO £15, 200 (sp) ldg 
sp, 208(sp) sp, 208(sp) lda 
E26 R26 ret 





-O3 -tune ev5 
ELO, CELS 
Eli; S (81) 
ELZ, 16 (21) 
£13, 24({(r1j 
Ela., 32 (213 
Fi, 40(r1) 
ti; £2, pa 
ro 

LLU: £i; £10 
Eiis Sl; ELL 
EL Fix £12 
Clay Dep Bis 
rid, £1, €Lé 
ELO: Gy £10 
ELl; © ELS 
Los, €, £12 
Eis, Gy £L3 
Ela, cy, F114 
£10, 56(£1) 
f11, 64(4r1) 
PAZ, vatrl) 
£13, 80(4r1) 
Ela, 88(4r1) 
ra, LS 

ris» Sp 

t26, 192 (715) 
ri5, 200 (sp) 
sp, 208(sp) 
R26 
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The unoptimized program for Unix in the first column of Table 12.3 somewhat 
resembles the unoptimized COM_C program as compiled for OpenVMS in the third 
column of Table 12.1. During each traversal of the loop, the index i is updated in mem- 
ory and separate base registers for addressing the two arrays are established. The loop 
test at L$3 also involves repetitive reloading of i from memory because the test is 
destructive of the value in register RO. The additive floating-point constant is reloaded 
from memory every time. 

The optimized COM_C program for Unix in the second and third columns of 
Table 12.3 resembles the optimized COM_F program in the first column of Table EA 
The loop is unrolled five-fold, and both floating-point constants are put into registers 
before the loop is entered. Register R1 serves as the base register for addressing the two 
arrays. The test for terminating the loop compares the advancement of that base register 
against a limiting value established in register R2. 


Tuning for an Implementation 


The Alpha compilers provide the ability to tune a program at compile time to suit one 
particular implementation using commands of the form: 


> cc -tune VAR com_c.c (Unix) 
$ cc/optimize=(tune=VAR) com_c (OpenVMS) 


where the option keyword VAR may be one of the keywords listed in Table 12.4, and 
where the appropriate system command for any of the other languages with GEM com- 
pilers may be substituted for cc. 

The general outcome of such tuning may be inferred by comparing the second 
column in Table 12.3 (generic) and the third column (ev5). At <A2 the two 1da 
instructions that require the CPU only are scheduled between the stq and ldt 
instructions that require accesses through the data cache, while at <A3 the three mem- 
ory-referencing instructions precede the two 1da calculations. Similarly, the instruc- 
tions marked <B2 and the cmpult instruction are interspersed among the floating- 
point load and add instructions in the second column, while the instructions marked 
<B3 and the cmpu1t instruction are clustered in the third column. 

The generic tuning (second column) thus ensures that the few integer operate 
instructions can be dual-issued with load or floating-point instructions, and that there 
will not be a data stall (involving register R1) at the cmpult instruction. Those are 
more serious concerns with the earlier implentations of the Alpha architecture. Newer 
implementations can better conduct issue-related optimizations dynamically, because 
the CPU has more internal pathways and functional units. 
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Table 12.4 Option Keywords to Identify Implementations to an Alpha Compiler 
Keyword implementation(s) Specified 


generic Generate instructions and select instruction tuning appropriate for all Alpha pro- 
cessors (the default option). The details underlying this type of optimization may 
change from time to time, in order to reflect the most prevalent chip implementa- 
tion in customers’ hands. 


host Generate instructions and select instruction tuning appropriate for the Alpha pro- 
cessor that the compiler is running on. 

ev4 Generate instructions and select instruction tuning appropriate for the 21064, 
21064A, 21066, and 21068 chips. 

ev5 Generate instructions appropriate for the 21064, 21064A, 21066, 21068, and 


some 21164 chips (those not supporting byte and word load/store instructions). 
Select instruction tuning appropriate for the 21164 chip. 


ev56 Generate instructions appropriate for some 21164 chips (those supporting byte 
and word load/store instructions). Select instruction tuning appropriate for the 
21164 chip. 

pease Generate instructions appropriate for the 21164PC chip, which supports byte and 


word instructions and also some multimedia instructions. (This keyword is not 
used for the tune option.) 
ev6 Generate instructions and select instruction tuning appropriate for the 21264 
chip. 
PE -NEE AEE N 
Use of Newly-Added Instructions 


An architecture can be, and the Alpha architecture has been, extended through addition 
of a limited number of new machine instructions, which will be noted in Chapter 13. 
Through firmware enhancements, the kernels of operating systems can intercept the 
exception when one of those new opcodes or function codes is encountered on a system 
having an older hardware implementation. Exception handling with software emulation 
can consume dozens of instruction cycles per occurrence, however. 

Since the proportion of systems based on the earlier implementations will 
decrease over time, however, options can be offered to permit or instruct a compiler to 
generate the newer categories of instructions. The Alpha compilers provide that flexi- 
bility to permit or inhibit instructions which could consume huge numbers of instruc- 
tion cycles on some implementations, through compilation commands of the form: 


> cc -arch VAR com_c.c (Unix) 
$ cc/architecture=VAR com_c (OpenVMS) 
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where the option keyword VAR may be one of the keywords listed in Table 12.4, and 
where the appropriate system command for any of the other languages with GEM com- 
pilers may be substituted for cc. 

Actually, the very simple COM_x programs do not contain any high-level con- 
structs which would benefit from any of the newer extensions of the Alpha architecture. 


In-Line Optimizations 


In Chapters 7 and 11 you had opportunities to observe the amount of overhead involved 
with function and procedure calls. The calling standards for the Alpha do diminish that 
impact relative to the situation for the VAX. Nevertheless, it has been common for 
many years for the system software for high-performance architectures to provide 
options for reducing call overhead. 

We have previously alluded to moving functions in-line, i.e., copying the body of 
the function or procedure right into the instruction stream of the caller rather than set- 
ting up a call. The same function can be replicated several times. While doing that does 
increase the overall size of the executable image, the virtual paging of the operating sys- 
tem can readily handle that aspect. Importantly, the total number of executed instruc- 
tions decreases by the amount of calling and returning which is avoided. 

The Alpha compilers provide the ability to consider all the components of a pro- 
gram holistically at compile time and to move routines in-line when that is advanta- 
geous. That feature is invoked using commands of the form: 


> cc -03 -inline KEYWORD program.c (Unix) 
$ cc/optimize=inline=KEYWORD program (OpenVMS) 


where KEYWORD controls how extensively the movement of routines in-line can be. For 
Unix, the default for KEYWORD is none when other optimizations are inhibited, but is 
size (i.e., move routines in-line unless program size would increase markedly) when 
other optimizations are permitted. Other KEYWORD values are manual (controlled by a 
#pragma inline preprocessor directive), speed (i.e., move in-line those routines 
having greatest impact on speed, plus those in the manual category), and all. For 
OpenVMS, the default KEYWORD is automatic (i.e., move in-line those routines 
having greatest impact on speed, plus those in the manual category). Other KEYWORD 
values are none, manual (controlled by a #pragma inline preprocessor direc- 
tive), and all. 

The very simple C program INLINE (Figure 12.2) has been prepared to compare 
the effect of inhibiting or specifying that an internal function square be put in-line. If 
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the compiler front end can consider both the main program and the function together, 
the compacted universal intermediate code will contain the instructions from both. The 
language-independent optimizer can then analyze opportunities to eliminate call over- 
head by moving portions of code in-line. 


"A This C program shows in-line optimization. */ 


#include <stdio.h> 
main () 
{ 
#ifdef _ VMS 
#include <ints.h> 
ints 22 x7, oy 
__int64 square( __int64 ); 
#define L "L" 
#else 
long int 12, x\v,ss 
long int square( long int ); 
#define L "1" 
#endif 
printf ("Enter 3 integers: "); 
Seant ("87 L =d” "Ss" L «gar "S* T "d" kx, iy ka]: 
r2 = square (x)+square (y)+square(z); 
printt( Ss" L "d\n" 22) 2 
} 
#ifdef _ VMS 
__int64 square( __int64 n ) 
#else 
long int square( long int n ) 
#endif 
{ 
return. n*n; 
Figure 12.2 INLINE: Program illustrating an in-line function 





Several columns taken from the listing files for INLINE are arranged in Table 12.5 
with some of the corresponding instructions aligned by rows or placed nearby as in the 
previous comparisons in this chapter. Both are for optimization level -03. 
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Table 12.5 Effect of Moving a Function In-Line 


main: 


ldah 
lda 
lda 
stq 
ldgq 
ldgq 
stq 
mov 
jsr 
ldah 
lda 
lda 
Lag 
lda 
lda 
lda 


jsr 


ldah 
ldq 
lda 
bsr 
ldq 
mov 
bsr 
ldg 
addgq 
bsr 
ldq 
ldq 


addq 


-03 -inline none 


gp, main 
gp, 
sp, 


rg, 


main 
-48(sp) 
8 (sp) 
24 (gp) 
r27; printe 


r9, 


r26; (8D) 
£9, E16 
£26, DPLACE 
gp, main 
gp, main 
r16, 24(r9) 
r27, scant 
Tis ¢ %8 

tib; y 

E19, Z 

r26, scanf 
gp, main 
ri; =x 

gp, main 
r26, square 
CLS, Y 

vO; xl 

r26, square 
F166, Z 

41, 20, X1 
r26, square 
r16, (gp) 
CAT: printi 
ri, 270, riy 


gp, (r27) 
gp, (gp) 
sp, -48(sp) 
r9, 8(sp) 
r9, 24(gp) 
r27, 16(¢p) 
r26, (sp) 
t9, x16 
$26, TAI 
gp, (r26) 
gp, (gp) 
rió; 24(r9) 
¥27, 8 (00) 
r17, 32 (80) 
rig, 24(sp) 
r19, 16(sp) 
#26, X27 
gp, (r26) 
ri6, 32 (sp) 
gp, (gp) 
r26, square 
t16, 24(sp) 
£0, BL 

r26, square 
r16, 16(6p) 
fl, CO, SS 
r26, square 
rió; (gp) 
r27, 16(gp) 
Ply EUs LY 


main: 


ldah 
lda 
lda 
stq 
ldg 
ldg 
stq 
mov 
Ter 
ldah 
lda 
lda 
ldq 
lda 
lda 
lda 
jsr 
ldg 
ldg 
ldah 
ldg 
lda 


mulq 


ldq 
ldq 
mulq 
mulq 
addq 
addq 


. 
. 


-O3 -inline all 
gp, main gp, 
gp, main gp, 
sp, -48 (sp) Sp, 
r9, 8(sp) 29, 
r9, 24 (gp) rg, 
tay printt ret; 
r26, (sp) T26; 
PS. y6 T3, 
r26, prints r26, 
gp, main gp, 
gp, main gp, 
r16, 24(r9) rib, 
r27, scanf E27, 
iy; = ELF 
r18, y Tis; 
Eig, 2 FL’, 
r26, scanf E26; 
nH, ¥ r0. 
mn, & t9, 
gp, main gp, 
f, Z ELi 
gp, main gp, 
D: My xO CO, 
rls, (gp) riLG., 
S27 Denci C27 
Nn, ü, ro rg, 
Nn, 0, wt cL, 
ro; BOs TO r9; 
50> tiz T17 ro, 


(r27) 
(gp) 
-48 (sp) 
8 (sp) 
24 (gp) 
16 (gp) 
(sp) 
t16 
t27 
(r26) 
(gp) 
24 (r9) 
8 (gp) 
32 (sp) 
24 (sp) 
16 (sp) 
E27 
24 (sp) 
32 (Sp) 
(r26) 
16 (sp) 
(gp) 


YO, £0 


(gp) 

16 (gp) 
#79, E9 
rij el 
rO; x0 


Er, Fiz 
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Table 12.5 Effect of Moving a Function In-Line 


-03 -inline none -O3 -inline all 
Jer r26; printf z £26, £27 jsr T26, Printi = PEG, £27 
ldah gp, main ; gop, (26) ldah gp, main ; gop, (r26) 
ldq r26, (sp) ; £26, (sp) läg r26; (sp) + T26, (8p) 
lda gp, main ; gp, (gp) lda gp, main ; gp, (gp) 
ldq r9, 8(sp) ; r9; 8 (sp) ldq r9, 8(sp) + 2S, 8 (sp) 
cir x0 ; x0 Ele. PFO f ro 
lda sp, 48 (sp) ; sp, 48 (55) lda sp, 48 (sp) ; Sp, 48 (sp) 
ret r26 s 26 ret r26 n ¥26 
square: : square: : 
muig n; Nn; xo 2+ Tlọ; z160; xO muig n, n, #0 y wile; Le, £O 
ret r26 $ X26 ret r26 ; £26 


The already-optimized program in the left half of Table 12.5 shows instruction 
scheduling to reduce data stalls, for example, separation of the 1dah and 1da instruc- 
tions from the expansion of the 1dgp pseudo-instruction after the returns from scanf 
and printf. The function square is implemented using a “lightweight” procedure 
call that requires no register saving. 

In the right half of Table 12.5, the entire body of the square function has been 
moved in-line three times. Note that the optimizer has used appropriate register substi- 
tutions each time for n in the mullq instruction. Also note the attention to instruction 
scheduling to reduce data stalls. 

This oversimplified example does not result in enlarging the memory storage 
requirements of the program (it actually shrinks by one instruction), because the body 
of the routine moved in-line consists of only one instruction. In the general case, of 
course, moving routines in-line would make a program physically larger. 

This example does, however, clearly illustrate the significant reduction in pipeline 
bubbles owing to branch instructions. Three call/return pairs have been eliminated 
entirely, for a total of six branch-taken disturbances to the smooth, sequential execution 


of instructions. 


Post-Compilation Optimization 


Research has shown that binary optimizers can bring about further improvements in 
performance for programs that run on RISC systems. 

Two types of post-compilation optimizers have been developed. Static analyzers 
modify a fully compiled and linked program. Dynamic optimizers collect information 
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while a program is running under its typical load, and then use that feedback informa- 
tion to rearrange the code and data sections optimally. An example of a static post-link 
optimizer is the om option for Digital Unix. Examples of dynamic optimizers are the 
cord utility for Digital Unix and Spike for Windows NT on the Alpha. 


Debugging Optimized Programs 


An optimized program no longer has an exactly matching relationship between the pro- 
grammer’s sequence of instructions in assembly language or a high-level language and 
the generated sequence of machine instructions. Accordingly, the recommended prac- 
tice is to inhibit all optimizations when using a debugger, in order to avoid confusion. 

In rare instances, an unoptimized program may appear to work correctly, yet the 
optimized version seems to operate peculiarly. Some debugger capabilities can still be 
of use, such as stopping on entry to a routine or watching for changes to a variable. 
Optimized instruction sequences may keep variables from a high-level language in reg- 
isters and update them in memory only infrequently. If you examine using the variable 
name instead of the appropriate register, you may see stale information. Moreover, keep 
in mind that a debugger cannot access a variable at all if it has been eliminated through 
optimization. 

When you have narrowed down the possible location of difficulty, you can step 
through by machine instruction, examine register contents, and inspect the sequence of 
actual machine instructions. Although voluminous, a listing file containing the machine 
instructions can also be helpful. 


Summary 


This chapter has extended the discussion of optimization begun in Chapter 11 at the 
assembly language level to the very sophisticated capabilities of compilers for high- 
level languages. Our examination of compiler output has looked at the effect of numer- 
ous levels and types of optimization on one very short program. We have explained how 
to find out what actual sequence of machine instructions a compiler has generated. 

Further examination of the internal operation of compilers, or the theoretical 
approaches behind their design and construction, would lie well beyond the scope of 
this book on Alpha architecture. 
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EXERCISES 


12.1 When a needed information unit cannot be found in a cache, i.e., when a cache miss 
occurs, the design of cache systems typically calls for some fixed number of adjacent 
information units to be loaded into the cache, that number of bytes (probably a power 
of 2) being called the cache block or line size. Explain why this method may lower 
the miss rate for an instruction cache to a greater extent than for a data cache. Decide 
whether a compiler should generate a modestly longer alternate instruction sequence 
if doing so could reduce data cache misses in a particular section of a program. 


12.2 Suggest why the compiled COM_P program (Table 12.1) claims 16 more bytes of 
stack space than the COM_C program. 


12.3 Explain how the machine instructions that implement loop control in the three vari- 
ants of the COM_x program (Table 12.1) do, in fact, match what each high-level lan- 
guage loop expression requires (Figure 12.1) for logical program flow. 


12.4 Find out whether a compiler uses a shift instruction instead of multiplication by 8 in 
COM_x (Table 12.1) if the variable I is defined to be a 64-bit integer (quadword) 
instead of a 32-bit integer (longword). 


12.5 Find out whether the intermediate levels of optimization for a high-level language 
compiler of your choice appear to produce gradations of efficiency interpolated 
between what was shown in Tables 12.1 and 12.2, or whether unexpected results 
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occur. Report your findings in a manner which might hypothetically have been 
included in this chapter. 


12.6 Define or explain what is meant by: loop unrolling; in-line expansion; instruction 


scheduling. 


12.7 Prepare an analytical report that compares various optimizations of one of these lan- 
guage variants of the MATRIX_x program containing a two-dimensional matrix that 
is involved in matrix multiplication. (These program files are included on the CD- 
ROM that accompanies this book.) 


PROGRAM MATRIX_F 


DOUBLE PRECISION A(11,17), 


BILI] CALL 


INTEGER *4 I, J 


DO I=1,11 
CLL) 


=0 


DO J=1,17 


C(I)= 


C(I) +A(I,J)*B(d) 


END DO 


END DO 
PRINT *, 
END 


GLF) 


PROGRAM MATRIX_P; 


VAR 


H- QA oO 


J 
BEGIN 


FOR i 


END. 


main () 


{ 


double a[11][17], 


et igo 
for (1=0% 
C eta] 


for 


§ ARRAY [losd ds «Lh 7] 
: ARRAY[1..17] 
* ARRAY [1.11] 


ieli 
= Qe 
(j= 
Eii] 


of double; 
of double; 

of double; 
integer; 


<1 TO 11. DO 
BEGIN 
efi] = Oy 
FOR j 2=1 TO 17 DO 
efi] s= @[a] + ali,3g] * BIST; 
END 


Blivls Cla 
i++) 


Oy 3417; 
= @{1]) + aliji] 


j++) 
* BJJ? 
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Comment whether any difficulties or ambiguities arise because the elements of A and 
B are uninitialized. 


12.8 (Internet or library project) The Gnu compilers for Unix systems have some machine 
independence when it comes to emitting code. For instance, one could implement a 
cross-compiler that compiles on a Sun for an Intel-based PC. Find out and report 
whether a Gnu compiler uses an intermediate code representation that is analogous to 
the GEM compilers. Also report on any optimizations and the amount of control one 
has over them. 


CHAPTER 13 


Extensions to the Basic 
Alpha Architecture 


N o vendor can expect to thrive in a rapidly paced 
technological marketplace without bringing forth both innovations which differentiate 
its products from those of others and continual developments that offer new features or 
better performance. For computer systems, these are the realms of architectural innova- 
tion and of implementational evolution in the distinction with which this book began. In 
practice, the dividing line between changes at the levels of architecture and implemen- 
tation can blur. Moreover, certain functions and features can be implemented in pure 
hardware, in firmware, or in pure software. 


Several successful computer architectures have persisted for two decades or even 
longer. During such long intervals, many temptations and perhaps some genuine oppor- 
tunities for modifying the architecture will occur. New instructions can be added where 
earlier phases of an architecture had unused, “reserved” opcodes and function codes. 
Any proposed change ought to be weighed against the impact upon existing hardware 
systems and the software applications already widely used on them. 

Can and should any new instructions be somehow supported retrospectively on 
older systems, through some sort of emulation? Which is the best use of available area 
at the chip level, to support additional instructions, to achieve faster instruction execu- 
tion, or to provide larger on-chip instruction and data caches? Different companies at 
different times have chosen a variety of responses to such concerns. 

Computer manufacturers sometimes extend a computer architecture in ways that 
call into question the rather strict definition of architecture being used in this book. For 
example, several of the PDP-11 implementations introduced midway along in the life 
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cycle of that architecture incorporated a “commercial instruction set” that included sev- 
eral groups of instructions to manipulate packed decimal numbers (used by COBOL) 
and strings. Older PDP-11 implementations could not execute programs compiled to 
take advantage of those new instructions, however. 

In such situations, a software developer would generally have to choose among 
several strategies: 


e Ignore the new instructions, thereby depriving any customer with new hardware of 
its maximal benefits. 

e Force customers to upgrade their hardware, either overtly or by defining a new or 
enhanced software product having higher minimum hardware requirements; the 
software supplier might lose revenue from those customers not upgrading to the 
new application, and would thereafter incur higher costs maintaining two software 
product lines. 

e Produce and maintain dual versions of the traditional software product through 
“conditional compilation,” thereby increasing overhead, distribution costs, sup- 
port costs, and perhaps the price of the software product. 

e Attempt to engineer a single executable file that contains, at one or multiple 
points, the necessary internal branching to use or not use the new instructions; this 
bigger software would need more memory on every customer’s system, with some 
performance degradation from the branching; the software supplier’s support 
costs might increase. 


The last two options are often preferred by customers, provided that the supplier does 
not have to raise its prices appreciably. 

The Alpha architecture specification explicitly requires that software emulation in 
the kernel of an operating system be able to complete any valid instruction which can- 
not be executed wholly by the hardware. As and when new instructions are added to the 
architecture, therefore, the system-supplied software must be reissued to enable sys- 
tems based on older hardware implementations to remain in architectural conformance. 
That approach has the advantage of requiring minimal action on the part of either cus- 
tomers (just the system software upgrade) or third-party software producers (just the 
decision of whether to use new instructions at all). 

Somewhat contrasting with the PDP-11, the VAX architecture has held much 
more closely to the stricter definition of computer architecture used in this book. For 
example, VAX implementations have varied in how the instructions for certain of the 
several VAX floating-point representations were handled, i.e., within the CPU, with 
add-on hardware, or through software emulation. Support software always ensured that 
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any implementation could execute any of the instructions, though perhaps with large 
relative time penalties. Analogously, the emulation of PDP-11 instructions moved from 
the CPU to supporting software emulation for later VAX implementations, including all 
of those in which the entire CPU was realized on one chip. 

In this chapter, we will trace some of the changes that have occurred with Alpha 
systems after their initial introduction in 1992. 


Basic Alpha Architecture 


Previous chapters of this book have discussed nearly all of the instructions that were 
directly supported in the earliest CPU implementations of the Alpha architecture. Those 
are now considered to comprise the basic Alpha architecture. 

We list in Table 13.1 some miscellaneous instructions in the basic set with a few 
later additions, but without very much discussion of them, since they are mainly used 
by system software rather than by user application software. The Alpha Architecture 
Reference Manual describes them. 


Table 13.1 Miscellaneous Alpha Instructions 


Instruction(s) Function 
call_pal Call a routine in the privileged architecture library (see later sec- 
tion in this chapter). 
ecb* Used to provide a hint that a data address will not be referenced 
again soon. 


excb*, mf_fpcr, mt_fpcr Used to ensure validity of, to read from, and to write to the float- 
ing-point control register (FPCR). 

fetch, fetch_m Prefetch data with intent to read or to modify (these are said not 
to be useful on any implementation through 21264, however). 


mb Ensure serialization of memory accesses in multiprocessor sys- 
tems. 

rc, ES For use only by a VAX-to-Alpha software translator; will not be 
permanently part of the Alpha architecture specification. 

rpec Read processor cycle counter (potentially useful for measuring 
timing effects on the order of nanoseconds). 

trapb Used to isolate the range of addresses in which an exception 
occurs and may be reported. 

wh64* Used to provide a hint that the 64-byte block containing the 


specified address will be overwritten soon but not read again. 


wmb* Causes pending writes to memory or I/O address space to be 
completed before any subsequent stores take effect. 


* Not part of the first implementation (21064 chip). 
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Although the first three major implementations of the Alpha architecture have all 
included VAX-compatible floating-point instructions, the architecture reference manual 
explains that those instructions could be “subsettable” in the future. An Alpha imple- 
mentation may include the VAX-compatible or the IEEE floating-point instructions, or 
both, or neither, at the hardware level. For example, some future implementation might 
emulate the VAX-compatible floating-point instructions in software rather than com- 
plete them using a functional unit within the CPU. The VAX-compatible floating-point 
instructions are used mostly in the OpenVMS programming environment. Our book 
omits direct discussion of the VAX-compatible data formats or instructions. 

The Alpha architecture also defines memory and register formats for the 
extended-precision IEEE X_floating representation, but without direct hardware sup- 
port in any of the first three major implementations. Hypothetically, new hardware 
instructions could be introduced for manipulating X_floating quantities in the future. 


Alpha Implementation Differences 


While this book does not treat matters of hardware organization in any detail, our readers 
should pick up some awareness of the sorts of implementation differences that any archi- 
tecture—the Alpha included—may come to have over time. Major issues include tech- 
nological feasibility, overall system performance, cost, and heat dissipation. The best 
technology does not always win, however. Favorable perceptions by opinion-makers, 
market acceptance, and the activities of competitors are also important factors influenc- 
ing ultimate success. 

As chip fabrication techniques are progressively refined to a smaller and smaller 
feature size, a given chip area can then be developed into more transistors. New imple- 
mentations can use additional transistors in several ways: 


e increase the internal superscalar parallelism, i.e., more and/or faster execution 
units; 

e provide internal registers, invisible to the assembly language programmer, that can 
help alleviate data stalls in deeply pipelined execution units; 

e increase the amount of on-chip cache for instructions and/or data; and 

e support additional instruction types. 


All of those enhancements have appeared in one or another of the successive implemen- 
tations of the Alpha architecture, as summarized in Table 13.2. 
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Table 13.2 Implementations of the Alpha Architecture 


Chip Name 21064* 21064A* 21164 21164¢ 21164PC 21264 

Date of introduction 1992 1993 1995 1996 1997 1998 
Process name EV4 EV45 EV5 EV56 PCAS56 EV6 
Metal layers 3 + -+ 4 6 
Feature size, um 0.75 0.5 0.5 0.35 0.35 0.35 
Operating voltage 3 3.3 2.0 Me, 2.0 
Maximum MHz 200 275 300 600 600 600+ 
Transistors, millions 1.68 2.8 9.3 3.4 15.2 
On-chip cache, KB 

I-cache 8 16 8 8 16 64 

D-cache 8 16 8 8 8 64 

S-cache none none 96 96 none none 
Invisible registers none none 6 6 80 
Address bits 

virtual 43 43 43 

physical 34 40 a3 
Memory path bits 128 128 128 
Superscalar degree 

integer l 2 2 4 

floating-point 1 2 2 = 
Instruction issue 

rate 2 2 4 4 4 4 

out of order no no no no no yes 


Pipeline stages 


integer 7 7 7 
floating-point 10 9 9 
Instruction subsets 
basic X X X X X X 
byte and word extension X X X 
motion video extension X X 
bit count extension X 
floating-point extension X 


* The related 21066 (1993) and 21066A (1994) chips offered somewhat lower performance because of a 64- 
bit memory path. 


Í This re-implementation of the 21164 chip is sometimes called the 21164A. 
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It is unimportant, for the purposes of this book, that some of the cells in Table 13.2 
are not filled in. Enough information is given there to enable you to see how the uses of 
available transistors can vary from one implementation to another. The voltage swings 
between the representations of logical 0 and logical 1 in the CPU generally have to be 
decreased as the feature size on the die for the chip decreases, in order to sustain 
increased clocking speeds and minimize heat. 

In implementations of the Alpha architecture, in contrast to some others from dif- 
ferent makers, the multiple integer or floating-point execution units are somewhat spe- 
cialized and cannot execute every type of integer or floating-point instruction, 
respectively. During decoding, instructions are targeted for an appropriate superscalar 
unit for execution. 

The 21064A used additional transistors primarily for additional on-chip cache. 
The 21164 went back to 8KB primary cache units for instructions and data, but added a 
large secondary cache unit for instructions and data on the chip. Additional enhance- 
ments included six invisible registers used internally by operating systems (through 
PALcode), superscalar execution units, and a doubling of instruction issue from two to 
four. The revised 21164 (aka 21164A) added byte and word instructions as an architec- 
tural extension. The 21164PC introduced motion video instructions as an architectural 
extension. The 21264 incorporated larger on-board primary caches, numerous invisible 
registers used in the pipelines, expanded superscalar execution elements, out-of-order 
instruction issue, and further architectural extensions with instructions for bit-oriented 
integer counts, square root, and moving data between Rxx and Fxx registers. 

The superscalar elements, invisible registers, and pipeline refinements enable the 
newer chips to have many instructions in process at once (as many as 37 for the 21164, 
or 80 for the 21264). 


Byte/Word Extension 


The original Alpha architecture stimulated some critical commentary because of its 
lack of any direct capability to access information units as small as bytes and words in 
memory and the necessity of using instruction sequences (Chapter 6) to do so. The 
designers had made certain assumptions regarding the needs of operating systems (ini- 
tially OpenVMS, followed shortly by Unix) and the likely bus structures for input and 
output (e.g., Futurebus+) that systems would use. On the basis of those assumptions, the 
omission of direct access to bytes or words seemed acceptable. 

Later, very careful simulations showed that the support of Windows NT and the 
use of the PCI bus could benefit by some five percent in overall performance if byte and 
word accesses were perhaps added to the Alpha architecture. Actual measurements, 
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after such instructions were added at the time of the re-implementation of the 21164 
chip, bore out those predicted benefits. Software and firmware changes permit older 
Alpha systems to recognize the new opcodes as exceptions and complete the requested 
load or store operations through emulation, but with a time penalty relative to the 
sequences in Chapter 6 (which do not trigger exceptions on any Alpha implementation). 

The group of new instructions comprising the byte/word extension to the Alpha 
architecture is summarized in Table 13.3. 


Table 13.3 Alpha Byte/Word Load, Store, and Sign Extend Instructions 


Function 

Mnemonic Opcode Code Purpose 

ldbu OA n/a Load Ra with unsigned byte contents found at effective 
address. 

ldwu OC n/a Load Ra with unsigned word contents found at effective 
address. 

stb OE n/a Store byte contents in Ra at effective address. 

stw OD n/a Store word contents in Ra at effective address. 

sexthb IC 00 Put sign-extended value from Rb or literal into Rc. 

sextw I 01 Put sign-extended value from Rb into Rc. 


The new load/store instructions use the same memory format (Figure 4.1) and fol- 
low the same assembler syntax as the original load and store instructions, namely: 


ldtu Ra, disp (Rb) 
SEE Ra, source 


where the information unit type t = b for byte data or t = w for word data, and where 
disp is a signed displacement to be added to the current address value in register Rb or 
where source is a symbolic address that is within 32K addressing units (in either 
direction) from a base register value known to the assembler. 

The 1dbu and 1dwu instructions perform zero-extension into the full 64-bit 
width of register Ra. That is, the information unit in memory is treated as the source of 
an unsigned quantity. For the 1dwu instruction, the item must be naturally word- 
aligned (i.e., bit 0 of the effective address must be zero), or else a very time-consuming 
exception will occur. 

In these Alpha memory-reference instructions, the displacement is allocated only 
16 bits. Since this displacement is interpreted as a signed byte address offset, only a 
range of data addresses from -32768 to +32767 with respect to register Rb is accessible. 
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If some data needed at the same time are more widely separated than 64K addressing 
units, more than one base register Rb can be used. 

The sign extend instructions use the integer operate format (Figure 4.1), but since 
they require only two real operands, the operand in the Ra position must be forced to R31: 


sextt Rb, Re ; Re <-- sign-extended contents of Rb 
sextb At; Re ; Re <-- sign-extended literal 


where the information unit type t = b for byte data or t = w for word data. (An exer- 
cise at the end of this chapter considers the analogous situation of sign-extending 
longword data.) 

Note that the sextb and sextw instructions, together with the zero-extending 
behavior of the byte and word load instructions, permit many cases shown in Chapter 6 
to be accomplished with just one or two instructions. If an implementation does support 
the new instructions, such short sequences will execute faster than the longer sequences 
although the latter can execute on any Alpha without producing exceptions. 


Motion Video Extension 


Intel Corporation introduced multimedia extensions (some 57 MMX™ instructions) 
into later implementations of its Pentium processors. Similarly, Sun Microsystems 
introduced VIS instructions into later implementations of the UltraSparc™ processors, 
MIPS introduced MDMX™ extensions into its processors using the MIPS V™ archi- 
tecture, and Motorola introduced AltiVec™ instructions into later implementations of 
the PowerPC processors. All of those extensions of previously successful architectures 
were motivated by perceived needs to be able to process data faster as vectors and 
encode and/or decode compressed video formats using the main processor instead of 
requiring add-in cards in systems supporting graphic-intensive applications. Some sets 
of such added instructions have been quite extensive. 

The corresponding architectural extension for the Alpha, first brought out in the 
21164PC chip, includes a comparatively minimal set of new instructions (Table 13.4), 
perhaps in keeping with the minimalistic nature of the original Alpha instruction set. 


Motion Video Extension 


387 


Table 13.4 Alpha Motion Video Instructions 


Mnemonic 


minub8 


minsb8 


minuw4 


minsw4 


maxub8 


maxsb8 


maxuw4 


maxsw4 


perr 


pklb 


pkwb 


unpkbl 


unpkbw 


Function 
Code 


3A 


38 


3B 


39 


| 


3E 


3D 


3F 


epi 


ay 


36 


35 


34 


Purpose 
Vector unsigned/signed byte/word minimum/maximum 
Copy into Rc each corresponding byte from Ra or Rb, 
whichever is smaller as an unsigned quantity. 


Copy into Rc each corresponding byte from Ra or Rb, 
whichever is smaller as a signed quantity. 


Copy into Rc each corresponding word from Ra or Rb, 
whichever is smaller as an unsigned quantity. 


Copy into Rc each corresponding word from Ra or Rb, 
whichever is smaller as a signed quantity. 


Copy into Rc each corresponding byte from Ra or Rb, 
whichever is larger as an unsigned quantity. 


Copy into Rc each corresponding byte from Ra or Rb, 
whichever is larger as a signed quantity. 


Copy into Rc each corresponding word from Ra or Rb, 
whichever is larger as an unsigned quantity. 


Copy into Rc each corresponding word from Ra or Rb, 
whichever is larger as a signed quantity. 

Pixel error 

Compute in Rc the sum of the 8 bytewise absolute differ- 


ences between values at corresponding byte positions in Ra 
and Rb; this can produce a result from 0 to 255*8. 


Pack and unpack bytes or words 
Pack the lowest byte of each longword within Rb into the 
lowest 2 bytes of Rc; clear the upper 6 bytes of Rc. 


Pack the lowest byte of each word within Rb into the lowest 
4 bytes of Rc; clear the upper 4 bytes of Rc. 

Unpack the lowest 2 bytes of Rb as zero-extended long- 
words in Rc. 

Unpack the lowest 4 bytes of Rb as zero-extended words in 
Re. 


a 
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The assembler syntax for these instructions, all of which use opcode 1C, can take 
the following forms: 


minstn Ra,Rb,Rc 
minstn Ra,lit,Rce 
maxstn Ra,Rb,Rc 
maxstn Ra,lit,Rc 


perr Ra, Rb, Rc 
pkbtb Rb, Rc ; Ra is not used 
unpkbt Rb,Rc ; Ra is not used 


where s = u for unsigned or s = s for signed, t = b8 for bytes or t = w4 for words, and 
t = 1 for longword or t = w for word. The two-operand instructions that pack and 
unpack bytes do not use the register in the Ra position (it should be R31). 

We will not offer further commentary on these instructions in the motion video 
extension or illustrate their actual use. Consideration of video compression techniques 
is an applied area rather far afield from our central purposes. 


Count Extension 


The instructions of the motion video extension and the cmpbge instruction of the basic 
Alpha instruction set deal with bytes of data within the integer register set. The instruc- 
tions of the count extension take the granularity of data analysis down to the bit level, 
more like the logical functions. 

The group of new instructions comprising the count extension to the Alpha archi- 
tecture, as implemented in the 21264 chip, is summarized in Table 13.5. 


Table 13.5 Alpha Bit Count Instructions 








Function 
Mnemonic Code Purpose 
etiz 32 Compute in Rc the number of contiguous leading zeros in 
register Rb, starting from bit position 63. 
CEt 33 Compute in Rc the number of contiguous trailing zeros in reg- 
ister Rb, starting from bit position 0. 
ctpop 30 Count the population of ones in register Rb and put the result 


in Re. 





The assembler syntax for these instructions, all of which use opcode 1C, can take 
the following forms: 
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etiz Rb, Rce - Ra is not used 
cttz Rb, Rc ; Ra is not used 
ctpop RD Rē ; Ra is not used 


These are essentially two-operand integer operate instructions, with register Ra not 
used. The assembler will set register Ra to R31, the mandatory value. 

These instructions have potential uses in rapid evaluation of bit-encoded data, 
considered 64 bits at a time. 


Square Root and Floating-Point Register Move Extension 


High-performance scientific computing has often predominated as the driving force for 
advanced computer architectures, not only ensuring the presence of basic floating-point 
arithmetic operations (add, subtract, multiply, divide) but sometimes prompting the 
inclusion of other mathematical instructions. The VAX architecture includes a polyno- 
mial instruction for calculating expressions of the form 


Co CIX + Cox? +... 
which are used in the evaluation of functions by series expansion. Many scientific cal- 
culations also involve the square root function, and a square root instruction is not 
uncommon among RISC architectures, e.g., PA-RISC®, PowerPC, and SPARC. 


The 21264 chip implementation of the Alpha architecture includes a square root and 
floating-point register move extension consisting of the new instructions in Table 13.6. 


Table 13.6 Alpha Square Root and Floating-Point Register Move Instructions 


Function 

Mnemonic Opcode Code Purpose 
Square root calculation* 

sqrts 14 O8B! Compute S_floating square root in Fc from Fb. 

sqrtt 14 OAB! Compute T_floating square root in Fc from Fb. 
Integer to Floating-Point Register Move* 

itofs 14 004 Map bits <0:31> from Ra to Fc using the S_floating 
register format. 

LEGEL 14 024 Transfer all 64 bits from Ra to Fc. 
Floating-Point to Integer Register Move 

ftois IG 78 Map bits <63:62> and <58:29> from Fa to bits <31:0> 
in Rc using the S_floating memory format; also repli- 
cate bit 63 from Fa to bits <63:32> in Rc. 

ftÖLt Ke 70 Transfer all 64 bits from Fa to Rc. 





* Other instructions for VAX-compatible representations are not shown in this book. 


f Basic function codes are shown; /c, /d, /m, /i, /s, /u qualifiers are recognized (though /u has no effect). 
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Square Root Instructions 


The square root instructions use the floating-point operate format (Figure 4.1), but 
since they require only two real operands, the operand in the Fa position must be forced 
to. F31; 


sqrtt Fb,Fc ; Fc <-- square root of value in Fb 


where t = s for S_floating and t = t for T_floating. These instructions return +0 if Fb 
contains positive or negative zero, but signal an invalid operation if Fb contains any 
other negative value. 

The calculation of a square root, like floating-point division, is not fully pipelined 
in the 21264 chip implementation; these instruction types therefore require many more 
cycles than floating-point addition/subtraction or multiplication. 


Floating-Point Register Move Instructions 


Early implementations of the Alpha architecture lacked the requisite internal data 
pathways that could have enabled direct transfer of data between the integer register set 
and the floating-point register set. The 21264 chip implementation does provide that 
capability in both directions with a group of register move instructions: 


LEOEL Ra,Fc >; Rb must be R31 
ftoit Fa,Rc > Fb must be F31 


where t = s for S_floating and t = t for T_floating. These instructions eliminate the 
accesses to memory required by the equivalent sequences using only the basic Alpha 
instruction set: 


tarit ‘Ra, Fa is equivalent to stt’ Rs, temp 
TAE Fd, temp 
ftoit Fs,Rd is equivalent to stt Fs, temp 


iat Rd, temp 


where t = s for S_floating and t = t for T_floating, where t’=1 if t = s or t’=qif 
t = t, and where s and d stand for source and destination, respectively. 

Note that these instructions open up opportunities to use underutilized registers in 
one set (probably floating-point) for saving temporary values when the other set (proba- 
bly integer) is already fully utilized. For this to happen, the rules and tables that control 
the register allocations made by compilers will require elaboration beyond what suf- 
ficed for previous Alpha implementations. 
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Determining Extensions and Implementation Version 


Once an architecture acquires extensions, the need arises for compilers and system 
boot-up code to be able to determine the characteristics of a particular machine config- 
uration. 


The Alpha architecture includes two instructions (Table 13.7) that can report 
about the nature of the computer on which they are executed. 


Table 13.7 IMPLVER and AMASK Instructions 





Function 
Mnemonic Code Purpose 

implver 6C Set the appropriate bit position in Rc based on the major imple- 
mentation level: 
0 ==> 21064 (EV4), 21064A (EV45), 21066A/21068A (LCA45) 
| ==> 21164 (EV5), 21164A (EV56), 21164PC (PCA56) 
2 ==> 21264 (EV6) 

amask 61 Report in Re which processor features, specified by a bit mask 
in Rb or a literal, are actually present; the following bit positions 
are used: 


0 ==> Byte/word extension (1ldbu, 1dwu, sextb, sextw, 
stb, and stw instructions) 


| ==> Square root and floating-point register move extension 
(ftois, ftoit, itofs, itoft, sqrts, and sqrtt instruc- 
tions) 


2 ==> Count extension (ct1z, ct tz, and ctpop instructions) 
8 ==> Motion video extension (maxsb8, maxsw4, maxub8, 


maxuw4, minsb8, minsw4, minub8, minuw4, perr, pk1lb, 
pkwb, unpkbl1, and unpkbw instructions) 


==> Precise arithmetic trap reporting supported in hardware 





The assembler syntax for these instructions, which share opcode 11 with other 
instructions in the integer operate group, is: 


implver Rc ; Ra must be R31 

; Rb must be the literal 1 
amask Rb, Re ; Ra must be R31 
amask 12.6, RÖ ; Ra must be R31 
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The implver instruction reports a code that identifies clusters of specific implementa- 
tions that share similar pipeline and execution unit designs, thus affording a way to 
select appropriate tuning (instruction-scheduling) decisions. 


For the amask instruction, source register Rb or the literal expresses a bit mask of 
the features to be inquired about. When this instruction executes, individual bits of the 
mask are cleared corresponding to features present and bits of the mask are otherwise 
transferred into destination register Rc corresponding to features absent. Thus the result 
is zero only if all the requested features are present. 


For later Alpha implementations (21164A, 21164PC, and 21264), the amask 
instruction operates bitwise as just discussed. For early Alpha implementations, a firm- 
ware emulation for the amask instruction always copies register Rb (or the literal) to 
Rc because none of the newer features are present. 


Privileged Architecture Library 


Many architectures have been designed for the possibility of running multiple operating 
systems, but a few also have certain features embedded in the hardware which one par- 
ticular operating system required at the time. For example, certain instructions and 
other features of the VAX architecture were so closely associated with the VMS operat- 
ing system that “VAX/VMS” came to be used as an adjective describing either the hard- 
ware or the software, before the latter was officially renamed OpenVMS. 


The Alpha architecture was explicitly designed at a level of abstraction which 
would not prejudice or foreclose future developments either in hardware implemen- 
tations or in operating systems over a projected architectural lifetime of at least two 
decades. 


This decision led to a clear distinction between the ordinary instruction set— 
which any program (e.g., user-specified or system-supplied) running in any mode (e.g., 
user mode or kernel mode) with any operating system (e.g., OpenVMS, Unix, Windows 
NT) would use—and everything else. All other requirements for particular operating 
systems, interrupt processing, I/O device mapping and driver control, shared memory 
access, multiprocessor coordination, and the like would be relegated to a privileged 
architecture library (PAL) of support routines. This abstraction would thus permit such 
routines to evolve without recomplicating the fundamental RISC design, i.e., reduced 
complexity, of the Alpha. 


The contents of a privileged architecture library are collectively called PALcode. 
When those instructions execute, the processor is said to be operating in PALmode. 
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Call_pal Instruction 


A special instruction format (PAL format) is specified in the Alpha architecture for 
making entry into PALcode (Figure 4.1). The call_pal instruction consists of the 6- 
bit opcode 00 followed by a function field in bits <25:0>. Ordinarily only the lowest 8 
bits are used, with bit 7 being set to 1 if the PALcode instruction is unprivileged or 0 if 
it is privileged. 

When a PALcode instruction comes along in the instruction stream, any previous 
instructions underway in the execution unit pipelines are permitted to complete so long 
as they do not raise exceptions. Then the PALcode instruction is issued, resulting in a 
transfer of control to the function dispatch within the privileged architecture library. 

For the most part, PALcode is written using the regular Alpha instruction set, but a 
few additional opcodes are set aside for present or future implementations of PALcode. 
Those are named pal19, pal1B, pal1D, pal1E, and pal1F. The last two charac- 
ters of these instruction mnemonics match the corresponding hexadecimal value for the 
Alpha opcode. | 

In addition, we mention here for completeness that mnemonics opc01 through 
opc07 and the corresponding Alpha 6-bit opcodes 01 through 07 are reserved for any 
future architectural extensions. Opcodes 0A, OC, OD, OE, 14, and 1E were formerly 
reserved but now have designated uses in conjunction with the extensions already dis- 
cussed in this chapter. 


PALmode Environment 


Control passes to the PALmode environment either under software control, i.e., through 
executing a call_pal instruction, or upon the occurrence of an exception (like over- 
flow) or a hardware interrupt. In PALmode, further interrupts are disabled, any implemen- 
tation-specific hardware functions are enabled, and all functions of the machine can be 
controlled. Operating systems thus invariably appropriate PALmode as part of the kernel. 

PALcode has to consist of appropriately intertwined components that reflect the 
characteristics or requirements of the operating system, the chip implementation, and 
the rest of a complete system (i.e., memory, bus, and device control structures). Conse- 
quently, it is not possible to discuss PALcode and PALmode very thoroughly in isola- 
tion from numerous specificities that lie outside the scope of this book. 


Universal PALcode Instructions 


A few PALcode instructions are required of every Alpha implementation, though some 
of those may have effects (including no CPU action at all) that vary with the operating 


system (Table 13.8). 
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Table 13.8 Required PALcode Instructions 





PALcode Function 
Mnemonic Code Type Purpose 

halt 0000 Privileged Halt the processor; then possibly restart or take 
other action based upon operating system spe- 
cific circumstances. 

draina 0002 Privileged Wait for deliberate aborts such as testing for 
nonexistent memory to be completed. 

imb 0086 Unprivileged Make the instruction stream coherent with the 
data stream. 

Dpt? 0080 Unprivileged Used by debuggers to put breakpoints in a pro- 
gram. 

bugchk* 0081 Unprivileged Used for reporting errors, e.g., those that are 
probably hardware-related. 

cserve* 0009 Privileged Used for implementation-dependent purposes. 

gentrap* OOAA Unprivileged Used for reporting error conditions detected by 
software. 

rdunique* 009E Unprivileged Read unique context value. 

swppal* 000A Privileged Transfer control to a different PALcode, as 
during a bootstrap process. 

wrunique* 009F Unprivileged Write unique context value. 


Sahina aO aae a a aai aa taps pO 
* An operating system is required to recognize these function codes even if it does not need to use them (as 
is the case with Windows NT for bugchk, cserve, rdunique, and wrunique). 


The imb instruction ensures that any modification to the instruction stream, 1.e., 
any instruction longword that has just been stored, will be read correctly. 


Since we have extensively used debuggers, we briefly note here how an instruc- 
tion such as bpt can work. When a program is being run under the control of the 
debugger, it is generally not being run through an interpreter or emulator as you may 
have imagined. Instead, a breakpoint is inserted by replacing an actual instruction with 
call_pal bpt and by saving the displaced instruction and its address in a data struc- 
ture maintained elsewhere by the debugger. 


When bpt is encountered during the flow of execution, control passes by excep- 
tion into PALcode. As with all exceptions (and interrupts), the program counter value 
from the interrupted instruction sequence is used both as a clue about the necessary 
actions to be taken and as the ultimate address for resumption of normal computing. If 
you decide to continue execution from a breakpoint, the debugger restores the displaced 
instruction and adjusts the return program counter value so that the displaced instruc- 
tion and those following it will be executed. 


Privileged Architecture Library 395 


Other capabilities of a debugger such as handling watchpoints and tracepoints are 
too specialized to be explained in this book. 

The rdunique and wrunique PALcode instructions can be used to assist with 
“threads” of execution, a concept not discussed in this book. The Unix assembler recog- 
nizes these as PAL_rdunig and PAL_wrunigq, while the OpenVMS assembler rec- 
ognizes them as read_ung and write_unq. 


PALcode Specific to Operating Systems 


As previously mentioned, OpenVMS took advantage of several distinctive features of 
the VAX architecture which helped account for the fact that OpenVMS was not ported 
to other hardware architectures lacking them. Those features included a group of 
machine instructions for inserting and removing elements from longword or quadword 
queues; those are provided as a corresponding group of unprivileged Alpha PALcode 
instructions. 

OpenVMS also requires several specialized registers that were implemented in 
hardware on the VAX. On an Alpha running OpenVMS, numerous privileged instruc- 
tions in the PALcode library for OpenVMS can read from and write to particular mem- 
ory locations designated to model the behavior of such registers. 

Similarly, the PALcode libraries for Unix and Windows NT contain appropriate 
privileged instructions to read from and write to particular memory locations designated 
to model critical data structures required by kernel code. Additional PALcode instruc- 
tions handle returns from interrupt processing and other typical functions of an operat- 
ing system kernel in ways that best take advantage of the Alpha architecture. 

Since PALcode executes with interrupts disabled, most PALcode routines are 
actually quite short. At contemporary processor speeds, the risk of missing interrupts 
from clocks or spinning disks does not pose the formidable challenge of former years. 
Nevertheless, interrupt-handling routines are still kept as short and fast as possible. 

The Unix assembler does not automatically recognize the symbolic PALcode 
mnemonics. An include file is used in order to preclude the need to look up each hexa- 
decimal function code: 


# include <machine/pal.h> 
call_pal PAL NAME 


where NAME is the PALcode mnemonic desired from Table 13.8 or any of the Unix-spe- 
cific instructions. The OpenVMS assembler does recognize the PALcode mnemonics 
directly. 
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Summary 


The concept of an architecture, apart from implementations, has progressively pervaded 
the computer industry since the introduction of the IBM 360 family. As the number of 
independent makers and truly different architectures has decreased over time, the 
lifespan of successful architecture families has tended to lengthen. Over periods of a 
decade or more, extensions to an architecture simply become inevitable, in order to 
accommodate different technology, different applications, or different markets. 


The Alpha architecture changes through selective additions of instruction subsets 
into initially sparse opcode and function code address spaces. Accompanying changes 
in compilers and in operating system kernels ensure both forward and backward com- 
patibility. The most specialized aspects of hardware designs and of operating systems 
are localized in PALcode and thus have a deliberately restricted visibility to the assem- 
bly language programmer, i.e., to the prototypical reader of this book. 
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EXERCISES 


13.1 
13.2 


13.3 


13.4 


13.5 


13.6 


13.7 


13.8 


13.9 


Why do hardware makers extend an already successful architecture? 


What are the drawbacks of extending an architecture, and what are some of the ways 
in which compatibility can be assured? 


What are the competing demands for transistors on a chip that must be kept in some 
balance in new implementations of an architecture? 


(Small research project) Read about the rpcc instruction in the Alpha Architecture 
Reference Manual. Use that instruction in order to see whether you can detect cycle 
timing differences in assembly language fragments adapted from some of the opti- 
mized and unoptimized compiler output shown in Chapter 12 or from examples of 
your own choosing. 


The Alpha assemblers recognize sext1 as a pseudo-operation that results in the 
instruction add1 R31, Rb, Rc. Review the add1 instruction and explain how this 
produces the desired effect. 


Define the parity of a 64-bit value as 0 if the total number of 1 bits is even or as 1 if 
the total number of 1 bits is odd. Write a concise sequence that computes in register 
RO the parity of the value in register R16: (a) using only the basic Alpha instruction 
set; or (b) using additional instructions from the extensions if helpful. 


What is unique about the hexadecimal representation of cal l_pal halt? For 
extra fun, find out the opcode used for the halt instruction in some other architec- 
ture(s). 


Speculate whether the unprivileged imb PALcode instruction might have a role to 
play in the hypothetical situation where a program modifies (or creates) its own 
instructions and then executes them. 


(Project) Research another computer architecture, either past or present, and report 
on the expansion of its instruction set over time, the rationale for modifications, and 
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the ways in which potential adverse impact of those changes upon compatibility for 
extant or new software was mitigated. 


13.10 (Strongly recommended) Incorporate all of the additonal Alpha instructions from 
this chapter into your personal summary chart(s). 





Suggested Resources 


I n the interest of clarity as well as helpfulness to read- 
ers, we outline here some concise information about system configurations suitable for 
those teaching or learning about the material covered in this book, as well as a few 
extensions to related matters. Nothing here should be construed as an endorsement of 
specific products by the authors or the publisher, however. 


System Hardware 


Quite modest hardware configurations will support a class of undergraduates, i.e., either 
a dedicated small system, local to a computer science department, or a general purpose 
server with network access for individual interactive acounts. 

There are numerous remarketers of new and used hardware made by Digital 
Equipment Corporation, a large majority of whom belong to the DDA — The Associa- 
tion of the DEC Marketplace: 


http: //www.dda.org 


Learning about the Alpha need not occur on a brand-new system, by any means. 
Ordinary individual user accounts with no special privileges are entirely suitable 
for the programming exercises suggested in this book. 
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Unix 
Testing of Unix programs for this book occurred primarily via local area network access 
(telnet and ftp) on a 125-MHz Alpha workstation based on the 21064 chip. Similar 


access for students is to take place on a 500-MHz Alpha workstation based on the 
21164A chip. 


OpenVMS 


Testing of OpenVMS programs for this book occurred via local area network access 
(LAT, telnet, and ftp) on an Alpha server system with dual 275-MHz processors based 
on the 21164 chip, and previously on an Alpha server system based on the 21064 chip 
that permitted access from individual student accounts. 

OpenVMS does not by default include TCP/IP applications like telnet and ftp. 
OpenVMS systems can be made network-accessible with, for example, MultiNet (Pro- 
cess Software Corporation). 


System Software for Programming Environments 


We developed the examples in this book using computer systems that were already set 
up and being utilized for other institutional purposes. No other special modifications 
were necessary. 


Unix 
In addition to the default C compiler and Alpha assembler (cc command), we are using 
the FORTRAN compiler (£77 command) and the Pascal compiler (pc command) for 
student assignments and projects. 

In our particular institutional setting, it was undesirable to use the standard shell 
for Digital Unix, which lacks the convenience of OpenVMS-like command replay with 
the up-arrow or other niceties that are offered by tcsh: 


http: //www.primate.wisc.edu/software/csh-tcsh-book 


Persons coming to Unix from experience with OpenVMS systems will, however, soon 
notice that tcsh does not do anything to provide command replay within interactive 
application programs, such as dbx. 

Well-formatted current Unix documentation may be conveniently accessed over 
the Internet through the URL: 
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http: //www.unix.digital.com/faqs/publications/pub_page/ 
doc_list.html 


Nearly the entire Digital Unix documentation set and man reference pages are included 
through this URL. 


OpenVMS 


The MACRO-64 assembler is not an integral part of the OpemVMS system software, 
but rather an add-on licensed software product. In addition to the MACRO—64 Alpha 
assembler, we have used the compilers for BASIC, C, COBOL, FORTRAN , and Pascal 
for the sorts of demonstrations and illustrations shown in this book. 

We became aware of two non-fatal problems with the OpenVMS MACRO-64 
assembler (V1.1-087), both involving symbolic labels, while developing program 
examples for this book. First, the symbolic debugger does not seem to be able to “see” 
labels defined with a single colon. Thus we have had to define key symbolic addresses 
with a double colon in the OpenVMS variants of sample programs for this book when 
those addresses were likely to be used for breakpoints or locations for examination. 
Second, and much more seriously, temporary labels like 10$ are supposed to be 
bounded strictly to a scope between the nearest two labels with names beginning with a 
letter of the alphabet. That is, a branch instruction and a target label of the temporary 
variety are not supposed to straddle on the opposite sides of a regular label. Because of 
this latter problem, we have only sparingly illustrated the use of temporary labels in our 
sample programs. (We include a dummy test program, BADSYMB.M64, on the CD- 
ROM that accompanies this book, with which those current behaviors could be 
rechecked if MACRO-64 is ever re-released by the vendor.) 


Desktop Client Access Software 


Formerly, the text editing phase of software development would typically take place 
using host-based chararacter-oriented editors like vi or emacs (Unix) and EDT or TPU 
(OpenVMS). The ascendancy of readily networked personal computers has brought the 
additional choice of editing programs on one computer that are intended for another. 

Here we suggest a few useful shareware or freeware tools for your consideration. 
(One other alternative, especially for Unix, would be Xwindows access software.) 
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Windows 9x/NT 


WordPad and the FTP client supplied with Windows 9x/NT system software are gener- 
ally satisfactory for editing ASCII program files and moving them to another computer 
system, respectively. 

The freeware Tera Term client program, however, deserves to be better known for 
telnet access from Windows client workstations to other computers: 


web: http: //spook.vector.co.jp/authors/VA002416/teraterm. html 

server: ftp.riken.go.jp 

path: /pub/pc/misc/terminal/teraterm/historical 

file: readme.txt 

path: /pub/pc/misc/terminal/teraterm 

files: ttermp23.zip (Windows 95/NT) 
ttermv14.zip (Windows 3.1) 

Macintosh 


Suggested freeware or shareware client programs for editing ASCII program files on a 
Macintosh computer (BBEdit Lite), moving them to another system over a network 
using ftp (Fetch), and establishing a telnet session (BetterTelnet) are as follows: 


Web: http: //www.barebones.com 

server: ftp.barebones.com 

path: /pub/ freeware 

file: BBEdit Lite .4.1.sit.bin 

Web: http: //www.dartmouth.edu/pages/softdev/fetch.html 

server: ftp.dartmouth.edu 

path: /pub/software/mac 

files: Fetch _ 3.0 _User_Guide.sit.hqx 
Fetch_3.0.3.hqx 

Web: http: //www.cstone.net/~rbraun/net/telnet/ 

server: ftp.cstone.net 

path: /users/rbraun/mac/telnet 

files: telmanual.sit.bin 


telnet68k.sit.bin 
telnetfat.sit.bin 
telnetppc.sit.bin 


Windows NT on Alpha Systems 


Windows NT (initially still as a 32-bit operating system) runs in either the worksta- 
tion or the server configuration on Alpha hardware with the appropriate PALcode. 
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Two frequently updated Web sites that are independent of vendors, are of potential 
interest: 


http: //AlphaNT.com 
http://users.otenet.gr/~snatcher/alphacentral.html 


Linux on Alpha Systems 


For this book we have used system software and compilers licensed from Digital Equip- 
ment Corporation. The Linux operating system has also been ported to the Alpha micro- 
processor. A selection of sites, current at the time of preparation of this book, is as follows: 


http: //www.redhat.com (one vendor of Linux for Alpha) 
comp.os.linux.alpha (news group) 
axp-list-request@redhat.com (subscription to a discussion list) 
ftp: //gatekeeper .dec.com/pub/DEC/Linux-Alpha/ (FTP site) 


Hardware-Level Information 


Readers who wish to extend their study of Alpha microprocessors down into the hard- 
ware organization level may find downloadable documentation through the URL: 


http://ftp.digital.com/pub/Digital/info/semiconductor/literature/ 
dsc-library.html 


For example, you can obtain the Digital Semiconductor 21164 Alpha Microprocessor 
Hardware Reference Manual intended for system designers and programmers. 

The Alpha architecture is designed by Digital Equipment Corporation, but several 
companies are licensed to manufacture Alpha microprocessor chips: 


http: //www.digital.com/semiconductor/ 
http: //www.samsungsemi.com/ 
http: //www.mitsubishichips.com/ 


A considerable amount of historical and comparative information about many dif- 
ferent microprocessors can be found at: 


http://www.microprocessor.sscc.ru/ 
http: //www.cs.uregina.ca/~bayko/cpu.html 


which have links to numerous other sites. 
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University Video Communications provides video lectures and tutorials featuring 
the leaders in computing and information technology: 


http: //www.uvc.com/ 


Titles on Alpha and other RISC architectures are available. 

Cray Research has produced a line of supercomputers using MPP design tech- 
niques (“massively parallel processors”) incorporating 16 to 2048 Alpha microproces- 
sors in CRAY T3D™ and CRAY T3E™ systems. 


http: //www.cray.com/products/ 


For compatibility with its customary version of a Unix-like operating system, Cray has 
made use of the capability of Alpha chips to be set to the big-endian mode for data rep- 
resentation. 





Answers and Hints for Selected 
Exercises 


W: have included several types of exercises in our 
book. Some exercises have specific, often numeric answers; we give answers for most 
of those here. Other exercises are designed primarily to stimulate thought or imagina- 
tion, perhaps best taking the form of a short-essay response; we give additional hints or 
rhetorical questions for some but not all such exercises here. Still other exercises 
require the writing of actual programs which are usually short and which frequently can 
usefully build upon the model programs that are explained fully in the text; we do not 
present worked-out programs, but we do provide occasional suggestions beyond those 
contained within the exercises themselves. 


CHAPTER 1 


1.1. What has made it possible for you to ride more than one type of bicycle? 
1.2. Consider the shape of the steering wheel versus the style of seats, for instance. 
1.3. Consider the characteristics of the keyboards. 


1.4. Why do automobile manufacturers build different automobiles? Are the reasons 
the same for brands of ice cream? 


1.5. Can different operating systems run on the same computer? Does this represent 
any form of standardization? 


1.6. Does it matter what executes the 680x0 instructions? 
19. 16,8. 
1.10. 0 to 4095 (unsigned). 
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1.11. 155, 525; 205064, 68148; FFFFFEAB, FFFEFSCC. 

1.12. a. 1100100, 144, 64; b. 4, 4, 4; c. 1000000, 64, 40; d. 100000000, 256, 400. 

1.13. a. 1FF; b. B02; c. 100; d. LOFFE. 

1.14. Two ways to convert from one base to another involve subtraction of powers (sub- 
tracting the highest possible value of the form a*b”) or the division method (divid- 
ing the one base by the new base). Either one can be applied here. 

CHAPTER 2 

2.1. What conflicts would result if these principal structures were visible? 

2.2. How many different buses are there on a typical “IBM-compatible” PC? Does the 
same software run on these machines with different buses? 

ets OTs 

2.4. Is a human memory addressed by storage locations or by contents? 

25: 160m 1d, 

2.6. Atleast 17 bits. 

2.7. Play the role of the computer, following the process of fetching and executing an 
instruction. 

2.8. Every instruction could contain the address of the next instruction. 

2.9. 8 terabytes. 

2.10. There are four information units and as many as nine data types for the Alpha dis- 
cussed in this chapter. 

2.11. Consider, by analogy, whether the decimal numbers 5 x 10? and 5.00 x 10 express 
the same thing. 

2.12. 40200000; 40266666. 

2.13. Are the representations the same as for exercise 2.12? 

2.14. 274-1. 

2.15. Use the table provided. 

2.16. 124F. 

2.17. Can you deduce what the /byte qualifier does? Are there other qualifiers that 
might be of interest? 

CHAPTER 3 

3.1. Consider both a very simple statement like A = B and a slightly more complex 
statement like A =—B. 

3.2. Cana statement lack a label and/or operands? 

3.3. Examine all opcodes and directives. 
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3.4. Can you, for example, write ^ĉa^a^? 
3.5. | Disambiguation, ease of change, etc. 
3.6. a. 101010, 052, 0x2A; b. *b101010, 052, *x2A. 
3.7. a. 0x9C, 0x80, 0x03, 0x29, 0x53, 0x04; b. *x9C, “x80, “x03, “x29, “x53, 0x02. 
3.8. a. 53 (decimal); b. 15 (decimal). 
3.9. 30000: 0000 0000 0000 0066 
30008: 0000 0000 0000 000C 
30010: 0000 0000 0000 0022 
here = 0x30000 (4x30000), relocatable; there = 0x68 (*x68) absolute; this = 
0x30008 (*x30008), relocatable; thing = OxC8 (6xC8), absolute. 
3.11. What do you know about the “structure” of a text file? 
3.12. Consider that values for symbols may be found in any of the input object files or in 
system object libraries. 
CHAPTER 4 
4.1. Five for the Unix version of the program. 
4.3. Can you use sub in place of add and vice versa? 
4.4. a. 16;b. 64. 
4.5. Bit 6 (value 0x40). 
4.6. Bit 1 (value 2). 
4.7. The calculation looks like this: 
2. => 110 => 110 
-3 ==> -011 ==>+ 101 
011 011 
And we see that the sign of the result is different from the signs of the two val- 
ues being added (columns at the right). 
4.8. As a 3-bit analogy, consider the product of +3 and —2: 011 * 110 = 010 010 as a 6- 
bit wide product (that is, 010 is the apparent result from the umu1h instruction). 
Since one operand is negative, we must subtract the other operand from this appar- 
ent intermediate result, 010 — 011 = 111. When the adjusted high-order result (111) 
and the low-order (010) result are placed together, as 111 010, they form the cor- 
rect two’s complement representation of —6. 
4.9. Expression parsing is an exact science, but there are many choices of assumptions 
and precedence conventions which could be made. 
4.10. a. FFFF FFFF FFFF FFFF; b. FFFF FFFF 0000 0000 
4.11. The 1da range is —32768 to +32767 by steps of one, while the 1dah range is 
—32768 * 65536 to +32767 * 65536 by steps of 65536. 
4.12. This will permit use of the full displacement range with each base register, result- 


ing in a span of the largest possible memory region. 
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4.13. a.0000 0000 0123 4567; b. FFFF FFFF 89AB CDEF; c. 0000 0000 0123 4567; 
d. FFFF FFFF 89AB CDEF 

4.14. a.0000 0000 FFFF FFFF; b. FFFF FFFF 7654 3210; c. FFFF FFFF FFFF FFFF; d. 
FFFF FFFF FFFF FFFF 

4.15. Size of an address for the respective architectures. 

4.16. Reverse the order of the instructions for the examples given in the chapter for 
(Rx) + and @ (Rx) + so that the decrementation is done before the indirect refer- 
ence to memory. 

4.17. The code would be similar to that in Figure 4.4, but the mulq instruction would be 
replaced by subq, and there would be one more iteration. 

4.18. The s4add1 instruction would allow each new vector element address to be easily 
calculated. 

4.19. 3N, 5N, TN, ON. 

4.20. Subtract from zero (fastest); multiply by minus one (slower). 

CHAPTER 5 

5.1. The actual displacement field is 21 bits, and there are two bits implied to be zero at 
the least significant end, for a total of 23 bits. Since 27 is ~8 million, one might 
think this would mean ~4 million in either direction. But each instruction is a long- 
word! 

5.2. Infinite loop. 

5.3. What is the seemingly strange behavior (side-effect) of the unconditional branch 
instruction? 

5.4. Yes. 

5.5. a. Stinginess?; b. Equal is equal, independent of interpretation of the bit pattern. 

5.6. Consider the program in Figure 5.1 and substitute a subtract operation for the mul- 
tiply operation. Are there any other modifications that would be needed? 

5.8. mov $4,$0 # Assume R4 holds the larger value 
cmplt $4,$3,$5# If R4 holds the smaller value, 
cmovlbs$5,$3,$0# then transfer the value in R3 instead 

5.9. Nothing. 

5.10. What test is always true for register R31? 

5.11. Save the contents of R14 when cmov1bs is actually executed. 

5.12. Begin from MAXIMUM as a model program, but look instead for a minimum 
value. 

5.13. and, 00204470; bis, 9ABCDEF8; xor, 88888888; ornot, 7777777F; bic, 00000008; 
eqv, 77777777. 

5.14. Interchange. 


Ke eee eee 
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5.15. Consult the information in Table 5.4. 
5.16. Think about XOR. 
5.18. Can you multiply the quotient by 10 and subtract from the original? 


CHAPTER 6 


6.1. Result would always be zero. 

6.2. (...E+3) mod 8=1. 

6.3. Use the debugger to show the output. 
6.4. | You may need to shift both right and left. 
6.5. Use the debugger to show the output. 


6.6. Not very helpful, both because cmpbge compares bytes separately in parallel and 
because the comparisons are unsigned, not signed 


6.7. The first character, in this case a carriage return, is “printed” by the IO_C program 
followed by a newline. In this case, the carriage return itself is echoed as a newline 
also. 


6.8. Why register R16? 
6.9. Find the jsr instruction and decode the Ra and Rb fields in it. 


6.10. Boolean result of two separate comparisons; loading, modifying, and storing a 
quadword of data. 


6.13. Remember that the lowest three bits in register Rb provide the s value. 


CHAPTER 7 


7.1. Reverse the signed displacements for the 1da instructions. Two conventions are 
possible, making the surface element be addressed as 0 (SP) or as -8 (SP). 


7.2. Zero would be a valid digit value, but a negative value could be used as the flag and 
be easily tested. 


7.3. Register R9 serves as the user stack. Do you only need to fix up instructions refer- 
encing register R9? 

7.5. A no-op. 

7.6. An infinite loop. 

7.7. Consider two carefully constructed range checks and five specific comparisons. 

7.8. Consider global/local and addressing overhead. 

7.9. a. None; b. no contents from before calls are guaranteed to be there after calls. 

7.10. 255 byte capacity (OpenVMS); no argument information provided for Unix calls. 

7.11. a. None; b. 255; c. 2% - 1 (in theory). 

7.14. One change would be to store the returned value in register R18, not R11. 
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7.15. Hints for Unix: just over x x 10’ seconds per year; largest positive signed 32-bit 
integer (Table 2.1). Hints for OpenVMS: just over x x 10!© nanoseconds per year; 
largest positive signed 64-bit integer (Table 2.1). 
CHAPTER 8 

8.1. Corresponding mathematical operations have the same numbers of operands, simi- 
lar branching choices would be desirable, etc.; it makes sense to structure the 
instructions to be parallel, even if they cannot share much actual digital logic. 

8.2. No bits are lost when either 32- or 64-bit data are moved between memory and 
floating-point registers as long as 1ds is always paired with sts and 1dt with 
stt. Since integer registers are heavily used for many purposes, floating-point 
registers may be more freely available. 

8.3. Consider by analogy adding 0.03 three times onto 7.1; stepwise, with rounding, the 
result would be 7.1 considering the implied uncertainty beyond the first decimal 
place. Now consider adding 0.03 + 0.03 + 0.03 = 0.1, onto which 7.1 is added for a 
result of 7.2. Now can you do the binary case? 

8.4. The absolute spacing between successive numbers jumps by a factor of two every 
time the binary exponent has to be incremented by one. 

8.6. cpyse F31,F31,Fx. 

8.7. What is the precision of the binary fraction? 

CHAPTER 9 

9.1. Look at second order differences. 

9.2. .repeat 25 
.blkq 5 
.blkb 40 
. endr 

9.4. Preset register R1 and change the instruction at ahead to indicate a vowel was 
found. 

9.6. a. .macro addl2 REG1,REG2 .macro add13REG1,REG2,REG3 

addq REG1,REG2,REG2 addq REG1,REG2,REG3 
.endm addl2 .endm addl3 

9.7. a.subq R31,Rx,Ry;b.eqv Rx,R31,Ry (XORNOT); c. Refer back to divi- 
sion by 10 in Chapter 5. 

9.8. Multiply the count from .narg by 8 to compute the byte displacement for the 
1da instruction. Use a dummy variable initialized to zero and incremented by 8 
after its use in the repeat block. 

9.9. a.subl R1,R2,R2;b.mulq R1,R2,R2;¢c.addq R1,R2,R2. 
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9.10. 
9.11. 


a. ldq r0,BaseData;b.1dgq RO, Baseé. 


a. It is important not to destroy any other register in the process. Using XOR can 
solve this difficulty, since Ra XOR Rb XOR Ra = Rb, etc. 


CHAPTER 10 


10.3. 


10.4. 


10.5. 


10.7. 


10.8. 


10.9. 
10.14. 
10.15. 


Every space will count as one word. Two possible modification schemes are to 
peek at the next character (this can be tricky to do right) or to remember the last 
character seen (more straightforward). 

Yes, the function scanf could match spaces between words. 

If the first string is ONE and the second is ON, that test would not trigger the nec- 
essary swap to move ON before ONE. 

Ten traversals, each moving 8 bytes; 9*8 + 10 = 82, where there are 8 instructions 
per traversal (10 for the final traversal). 

Perform a range check (A-Z) on a scratch copy of each character, and then force 
those copies either to upper case or to lower case before the equivalent of the 
instruction at why. Doing that will make like words be adjacent but not always in 
the same kind of ordered relationship from instance to instance. Why not? 

The long dashes. Anything else? 

Begin by converting all characters to lower case. 

Since you will be using the C functions for I/O, you could pretest your ideas for 
logic to handle the various EOF situations by writing a C program first. 


CHAPTER 11 


11.1. 
11.2. 
11.3. 


11.4. 
11.6. 


11.9. 


11.11. 


It has made it easier for a compiler to generate optimum code than a human. 

ta = 020" bran POLO ™ bian = 03 * Taga 029 ~ Epa 

12.2 ns. The generalized formula is fog = h * ti + (l-h) * hy * h +... 

+ (1-h,) * (1-h2) * ... * (1-h,4) * by 

Fourteen (not counting the external call of the routine). 

An inductive proof is appropriate here. Assume the stated number of passages to 
be true for F,,, and for F,.1. Since we compute F,, = Fn-2 + F,,.;, the number of full 
passages is 1 + (F,.. - 1) + (Fy - 1) = (Fn-2 + Fn-1) - 1 = F, - 1 and the number of 
abbreviated passages is F,,.. + F,., = Fn. The original assumption can be anchored 
by observing its correctness for n = 1, 2, and 3. 

Be sure that you are not fooled by ideosyncracies of the formatting of very large 
integers by the decimal print routines in the high-level language. 

Because performance is not visible to the programmer interface. 
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CHAPTER 12 


12.1. 


12.2. 
12.3. 
12.4. 


12.6. 


12.7. 


Except for branching and procedure calls, instructions are needed according to a 
uniformly ascending sequence of addresses. Data may be accessed in many ways: 
pseudo-randomly, in ascending address sequence, in descending sequence, etc. 


Where does Pascal keep i and c? 

Notice the common CMPLx instructions followed by the BNE instruction. 

There seems to be no good reason why a shift-related instruction could not be 
used. 

Changing from a tight loop to linear code; replacing the call with actual code; to 
schedule means to reorder to minimize data stalls. 

Some compilers may warn about uninitialized variables or about computed quanti- 
ties that are never subsequently used. 


CHAPTER 13 


13.1. 
13.2. 
13.3. 


13.5. 


13.6. 


13.7. 
13.8. 


What are the disadvantages of offering customers a new architecture? 
Take the Intel x86 architecture as an example. 


Over time, what features have moved from outside the chip to inside? Cache and 
floating-point instructions are two examples. What others can you cite? 

The longword arithmetic operate instructions always produce a sign-extended 
result (which could then be used in later quadword arithmetic operate instructions). 
a. Could you use a b1bs instruction and a shift instruction in a loop that runs up to 
64 times? Is it worthwhile to use a second branch or conditional move instruction 
to quit the loop early if there are only a few bits? b. Could you use the ctpop 
instruction? 

Totally zero, a popular value. 


Self-modifying code is perilous, but still occasionally encountered. 


Index 


character, logical OR operator (OpenVMS), 60 
" characters 
enclosing string data, 57 
enclosing string parameters (OpenVMS), 272 
character 
beginning comments (Unix), 51 
literal operand (OpenVMS), 100-101 
character 
beginning a system-defined macro (OpenVMS), 54 
ending a temporary label (OpenVMS), 122-123 
in symbols, 56 
character, logical AND operator, 60 
character, string concatenation (OpenVMS), 274 
character 
addition operator, 60 
unary positive operator, 61 


() characters, clarifying precedence (Unix), 61 


* 


character 
in comment delimiter /* . . . * / (Unix), 51 
multiplication operator, 60 

character, separating specifiers, 51 

character 
continuing a line (OpenVMS), 51 
introducing a flag (Unix), 47 
subtraction operator, 60 
unary negation operator, 61 


. character 


introducing a directive, 54 
symbol for location counter, 59 
in symbols, 56 
character 
in comment delimiter /*. . . * / (Unix), 51 
division operator, 60 
examine command (Unix), 69 
search command (Unix), 69 


: character 


=e 


| 
0 
0 


double, terminating a label (OpenVMS), 51, 122 
terminating a label, 51, 122 
character, beginning comments (OpenVMS), 51 
> characters 
clarifying precedence (OpenVMS), 61 
enclosing arguments (OpenVMS), 227 
enclosing string parameters (OpenVMS), 272 
character, generating a label (OpenVMS), 275 
character 
decimal operator for parameter substitution (Open- 
VMS), 273-274 
exclusive OR operator (OpenVMS), 60 
character 
exclusive OR operator (Unix), 60 
part of unary operators (OpenVMS), 55, 61, 272-273 
character in symbols, 56 
character, logical OR operator (Unix), 60 
character, binary complement operator (Unix), 60 
prefix for octal, 55 


x prefix for hexadecimal, 55 


21064 chip, 383-384 
21064A chip, 383-384 
21066 chip, 383-384 
21068 chip, 383-384 
21164 chip, 383-384 
21164A chip, 383-384 
21164PC chip, 383-384 
21264 chip, 383-384 
680x0 CPU, 107 

8086 CPU, 20 

8088 CPU, 20 

80x86 as a CISC architecture, 6 


A 


a access mode, 307 
/a qualifier (OpenVMS), 228 
ABS attribute (OpenVMS), 280 
Absolute section (OpenVMS), 280 
Absolute symbols, 63 
Access mode, 307 
Access time, memory, 323 
Accumulator register, 27 
Actual parameters (OpenVMS), 259, 268 
add1 instruction, 52, 83-85 
addq instruction, 52, 83-85 
Address 
space, 23, 383 
transfer (OpenVMS), 62 
.address directive (OpenVMS), 57, 59-60, 194-195 
Addresses 
comparing as unsigned values, 117—118 
number of, 27—29 
symbolic, 51—52 
Addressing modes 87—90, 105-108 
autodecrement, 105—107 
autodecrement deferred, 105—107 
autoincrement, 105—108 
autoincrement deferred, 105—108 
branch, 114-115, 119 
direct, 83, 88. 106—107 
displacement, 88-89 
immediate, 87 
indirect, 88 
literal, 83, 87-88 
memory direct, 88 
memory indirect, 88 
performance issues, 323 
register direct, 88, 106-107 
register indirect, 88, 105—107 
register indirect deferred, 106—108 
relative, 107 
relative deferred, 107 
adds instruction, 240-241 
addt instruction, 240-241 
.align directive, 53, 189-190, 279 
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Alignment 
of floating-point data, 34-35 
of integer data, 97 
for pipeline efficiency, 344 
of program sections, 279 
all keyword, 370 
Allocation, storage, 56—58 
Alpha architecture, 6, 29-30, 80, 379-381 
basic, 381-382 
byte/word extension, 384—386 
count extension, 388-389 
motion video extension, 386—388 
square root and floating-point register move exten- 
sion, 389-390 
/alpha qualifier (OpenVMS), 47 
Alphanumeric characters, 36-38 
ALU, 128 
amask instruction, 391—392 
Ampersand character, logical AND operator, 60 
and instruction, 130 
Angle brackets 
clarifying precedence (OpenVMS), 61 
enclosing macro arguments (OpenVMS), 227 
enclosing string parameters (OpenVMS), 272 
Answers for selected exercises, 405 
Antidependency, 343 
Apostrophe character, string concatenation (OpenVMS), 
274 
APPROXPI program, 250-254 
-arch flag (Unix), 369 
Architecture 
80x86 as a CISC example, 6 
Alpha as a RISC example, 6 
CISC, 6 
defined, 1 
extensions of, 369-370, 379-381 
load/store, 28, 29-30 
memory-memory, 29 
one-address, 27—28 
PDP-11, 5—6, 29, 80 
of the piano, 2 
register-register, 29 
RISC, 6 
stack-based, 27 
superscalar, 29-30, 343 
three-address, 28 
two-address, 28 
VAX, 29, 80 
VAX as a CISC example, 6 
zero-address, 27 
/architecture qualifier (OpenVMS), 369 
args parameter (OpenVMS), 165, 227-228 
Argument information register (OpenVMS), 200-201 
Argument passing, 200—203 
for BASIC (OpenVMS), 202 
for C (OpenVMS), 202 
for C (Unix), 202 
for COBOL (OpenVMS), 202 
by descriptor, 201 
for FORTRAN (OpenVMS), 202 
by immediate value, 201 
methods, 201-202 
for Pascal (OpenVMS), 202 
by reference, 201 
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Arguments for conditional assembly (OpenVMS), 262- 
263 
Arithmetic and logic unit, 128 
Arithmetic instructions, 25 
floating-point, 240-243 
integer, 83 
Arithmetic overflow 
floating-point, 242-243 
integer, 86 
Arithmetic shift, 132—133 
ASCII characters, 36-38 
ASCII codes, table of, 37 
.ascic directive (OpenVMS), 202 
.ascid directive (OpenVMS), 202 
ascii directive, 57 
/ascii qualifier, 69 
.asciiz directive (Unix), 57 
.asciz directive (OpenVMS), 57 
Assembler, symbolic, 54 
Assembler directives, 53 
Assembler optimizations, 335-341 
Assemblers, 3, 43 
error detection by, 64—65 
functions of, 64—65 
two-pass, 64—64 
Assembly language, 3-4, 43-44 
advantages, 4—5 
architectural dependence of, 3 
disadvantages, 4, 11, 44 
lack of portability of, 44 
operators, 52-53 
statement types, 50 
Assembly process, 64—65 
assign command (Unix), 69 
Asterisk character 
in comment delimiter /*.. . * / (Unix), 51 
multiplication operator, 60 
Attributes of program sections (OpenVMS), 279-281 
Autodecrement addressing, 105—107 
Autodecrement deferred addressing, 105—107 
Autoincrement addressing, 105—108 
Autoincrement deferred addressing, 106-108 
automatic keyword (OpenVMS), 370 
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b suffix, temporary labels (Unix), 122 
“b unary operator (OpenVMS), 55 
Backslash character 
decimal operator for parameter substitution (Open- 
VMS), 273-274 
exclusive OR operator (OpenVMS), 60 
BACKWARD program, 169-171 
BADSYMB program (OpenVMS), 401 
Base addressing, 88-89 
.base directive (OpenVMS), 53, 89 
Base for number systems, 12 
binary, 12 
decimal, 12 
octal, 12 
hexadecimal, 12 
BASIC, argument passing, 202 
Basic Alpha architecture, 381-382 
beq instruction, 115, 116 
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bge instruction, 115, 116 
bgt instruction, 115, 116 
bic instruction, 130-131 
Big-endian convention, 31, 160-163 
Binary files, 317 
Binary operators 
arithmetic, 60 
logical, 60 
binary option (OpenVMS), 278 
bis instruction, 130-131 
bit instruction (PDP-11 and VAX), 129-130 
ble instruction, 115, 116 
blbc instruction, 15, 116 
blbs instruction, 115, 116 
.blkb directive (OpenVMS), 57 
-b1k1 directive (OpenVMS), 53, 57 
-b1lkq directive (OpenVMS), 53, 57 
.blkw directive (OpenVMS), 57 
Block size of a cache, 344 
blt instruction, 115, 116 
bne instruction, 115, 116 
Bits, numbering of, 31 
Boolean condition, 116 
Boolean functions, 128—129 
Bounds checking, 357 
bpt (PALcode), 394 
br instruction, 52, 115, 118-119 
Branch addressing, 115, 119 
Branch class of instructions, 81 
Branch displacement, 115, 119 
Branch instructions 
floating-point, 244-245 
integer, 114-116 
pipeline behavior, 345 
subroutine call, 192—193 
subroutine return, 192—193 
time penalty, 123-124 
Branch prediction, 343, 345 
Breakpoints, 68-74, 394-395 
bsr instruction, 187-188 
Bubble sort 
Integers (SORTINT), 313-317 
Strings (SORTSTR), 300-304 
Bubbles in a pipline, 343, 373 
bugchk (PALcode), 394 
Bus, 20 
BYTE alignment (OpenVMS), 279 
.byte directive, 57 
Byte-addressable memory, 23 
Byte-manipulation instructions, 144 
Byte/word extension, 384—386 
Bytes, numbering of, 31 
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/c display mode (Unix), 69 
“c unary operator (OpenVMS), 61 
C language 
argument passing, 202 
compiler output, 358-359 
library functions for I/O, 293 
Cache 
block size, 344 
hit ratio, 323 
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line size, 344 

misses, effect on pipelining, 347 

subsystem, 21—23, 384 
Call frame, 326-327 
$call macro (OpenVMS), 165-166, 227-229 
call_pal instruction, 381, 393 
Calling sequence 

OpenVMS, 205-106 

Unix, 711 
cancel command (OpenVMS), 70 
Caret character 

exclusive OR operator (Unix), 60 

part of unary operators (OpenVMS), 55, 61 
Case insensitivity (OpenVMS), 47 
Case sensitivity (Unix), 47 
Case structures, 193—195 
cc command (Unix), 47 
CDC 6600, 354 
Central processing unit, 20-21 
Characters 

ASCII codes, 37 

for binary constants, 55 

byte storage for, 38 

control, 36 

for decimal constants, 55 

for hexadecimal constants, 55 

for octal constants, 55 
/check qualifier (OpenVMS), 357 
chrget routine, 164-165 
chrput routine, 164-165 
Circumflex character 

exclusive OR operator (Unix), 60 

part of unary operators (OpenVMS), 55, 61, 272-273 
CISC architecture, 6 
CISC instructions, power of, 323-324 
clock function (C), 333-334 
Closed routines, 266 
cmoveq instruction, 124—125 
cmovge instruction, 124—125 
cmovgt instruction, 124-125 
cmovlbc instruction, 124—125 
cmovlbs instruction, 124-125 
cmovle instruction, 124—125 
cmovlit instruction, 124-125 
cmovne instruction, 124—125 
cmpbge instruction, 171-173 
cmpeq instruction, 116-117 
cmple instruction, 116-117 
cmplt instruction, 116-117 
cmpteq instruction, 245 
cmptle instruction, 245 
cmptil1t instruction, 245 
cmptun instruction, 245—246 
cmpule instruction, 116-118 
cmpult instruction, 116-118 
cnd option (OpenVMS), 278 
COBOL, argument passing, 202 
S$code_section macro (OpenVMS), 53 
Codes 

operation, 53 

pseudo-operation, 53 
Colon character 

double, terminating a label (OpenVMS), 51, 122 

terminating a label, 51, 122 
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COM_C program, 356 
COM_F program, 356 
COM_P program, 356 
.comm directive (Unix), 53 
Comma character, separating specifiers, 51 
Comment field, 51 
importance of, 76 
Commercial instruction set (PDP-11), 379380 
COMMON block (FORTRAN), 280 
Comparative instructions, 25 
floating-point, 245-246 
signed integer, 116-117 
unsigned integer, 116-118 
Compare byte instruction, 171-173 
Comparing addresses, 117—118 
Comparing text files, 74-75 
Compilers, 4, 43-44, 353-354 
Computer architecture defined, 1 
Computer languages, 3—4 
Computer structures, 20 
central processing unit, 20-21 
input and output devices, 20, 24 
memory, 20, 21-23 
CON attribute (OpenVMS), 280 
Concatenated section (OpenVMS), 280 
Concatenation of strings (OpenVMS), 274 
Conditional assembly, 261—266 
Conditional branch instructions 
floating-point, 244—245 
integer, 116 
Conditional move instructions 
floating-point, 246-247 
integer, 124—126 
conditionals option (OpenVMS), 278 
Constants, 54-55 
ASCII, 55 
binary, 55 
decimal, 55 
hexadecimal, 55 
loading into registers, 95-96 
octal, 55 
Context preservation, 203—205 
conti command (Unix), 70 
Continuation, line (OpenVMS), 51 
Control characters, 36 
Control instructions, 25, 114 
Control statements, 50, 61—62 
Control string for formatted I/O, 296, 308-309 
Control structures, 113 
Control-Y (OpenVMS), 10 
Conventions for register use, 197—199 
Conversion instructions, 248—250, 390 
Copy sign instructions, 247—248 
cord utility (Unix), 374 
Coroutines, 196-197 
Counter, location, 54, 58—60, 62-63 
. character as symbol for, 59 
as an offset, 59 
incrementation of, 58—59 
multiple instances, 59 
CPU, 20-21 
cpys instruction, 248 
cpyse instruction, 248 
cpysn instruction, 248 
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cserve (PALcode), 394 
ctlz instruction, 388-389 
cttz instruction, 388-389 
ctpop instruction, 388-389 
cvtlq instruction, 249 
cvtql instruction, 249 
cvtqs instruction, 249 
cvtgt instruction, 249 
cvtst instruction, 249 
cvttq instruction, 249 
cvtts instruction, 249 
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/D display mode (Unix), 69 
^d unary operator (OpenVMS), 55 
Data addressing 
OpenVMS, 100-101 
Unix, 98—100 
Data conversion instructions, 248—250, 390 
Data dependencies, 343 
.data directive (Unix), 53 
Data movement instruction, 25 
Data section (OpenVMS), 207—208 
Data section pointer (OpenVMS), 207 
S$data_section macro (OpenVMS), 53 
data_section_pointer parameter (OpenVMS), 
209 
Data stalls, 344 
Data types 
alphanumeric characters, 36-38 
floating-point numbers, 33—36 
integers, 32 
logical, 128-131 
Datapath, 80 
dbx debugger (Unix), 68-74, 98—100 
/dd display mode (Unix), 69 
/debug qualifier (OpenVMS), 47 
Debuggers, 68-74 
bpt (PALcode), 394-395 
capabilities of, 68-71 
examples of commands 
breakpoint, 71, 72 
tracepoint, 73 
watchpoint, 73-74 
OpenVMS example, 72 
and optimization, 374 
Unix example, 71-72 
/dec qualifier (OpenVMS), 69-70 
Declarative statements, 50 
DECNUM program, 137-139 
DECNUM2 program, 183-185 
delete command (Unix), 70 
Delimiters for string parameters (OpenVMS), 272-273 
Denormalized IEEE numbers, 236 
deposit command (OpenVMS), 69-70 
Descriptor 
passing argument by, 202 
procedure, 205-106 
string (OpenVMS), 202-203 
Device drivers, 24 
diff command (Unix), 74 
differences command (OpenVMS), 74 
direct addressing, 83, 88, 106-107 
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Direct assignment, 55 
Directives, assembler, 53 
Directories, 289 
Displacement addresing, 88—89, 239-240 
display command (OpenVMS), 74 
Display modes for dbx (Unix), 68-74 
Display output, 293—295 
Division 
floating-point, 240-241 
integer, 87, 135-137, 214-216 
by ten, 135-137 
by zero, floating-point, 242—243 
divl pseudo-instruction (Unix), 215-216 
divlu pseudo-instruction (Unix), 215-216 
divq pseudo-instruction (Unix), 215-216 
divqu pseudo-instruction (Unix), 215-216 
divs instruction, 240-241 
divt instruction, 240-241 
DMA controllers, 24 
do clause (OpenVMS), 74 
Dollar sign character 
beginning a system-defined macro (OpenVMS), 54 
ending a temporary label (OpenVMS), 122-123 
in symbols, 56 
Dot character 
introducing a directive, 54 
symbol for location counter, 59 
in symbols, 56 
DOT_3 program, 102-105 
DOT_N program, 119-122 
$dp symbol (OpenVMS), 65 
draina (PALcode), 394 
$ds symbol (OpenVMS), 65 
Dynamic optimization, 373-374 
Dynamic rounding, 242 
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ecb instruction, 381 
Editors, use of, 46 
Effective address, 88, 90, 93, 94 
Efficiency of loop design, 123—124 
.else directive (OpenVMS), 264-265 
Encapsulation, 163—165 
. end directive, 53 

specifying transfer address (OpenVMS), 62, 220 
End of file, 309, 312 
S$end_routine macro (OpenVMS), 53, 211 
.endc directive (OpenVMS), 262-263 
Endian conventions 

big endian, 31, 160-163 

little endian, 31, 160-163 
.endm directive (OpenVMS), 267 
.endr directive (OpenVMS), 258-259 
.ent directive (Unix), 53, 211 
EOF condition, 309, 312 
Epilogue, standard 

for OpenVMS, 210 

for Unix, 211 
eqv instruction, 130-131 
Error detection, 312, 317 
ev4 keyword, 369 
ev5 keyword, 369 
ev56 keyword, 369 
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ev6 keyword, 369 
examine command 
DCL (OpenVMS), 7, 10-11 
debugger (OpenVMS), 69-70 
excb instruction, 381 
Exceptions, 242-243 
Exclamation character, logical OR operator (OpenVMS), 
60 


EXE attribute (OpenVMS), 280 

.exe file type (OpenVMS), 46 
Executable image file, 43 

Executable section (OpenVMS), 280 
Exit code (Unix), 211 

expansions option (OpenVMS), 278 
Exponent, 33—36 

Expressions, 60-61 

extbl instruction, 144-146 
Extensions to an architecture, 379-381 
EXTERNAL keywork (Pascal), 219 
extlh instruction, 144-146 

ext11 instruction, 144-146 

extqh instruction, 144-146 

extql instruction, 144-146 

Extract byte instruction, 144-146, 161-162 
extwh instruction, 144-146 

extwl instruction, 144-146 
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£ suffix, temporary labels (Unix), 122 
£77 command (Unix), 400 
fabs pseudo-instruction, 248 
fbeq instruction, 244—245 
fbge instruction, 244—245 
fbgt instruction, 244—245 
fble instruction, 244—245 
fblt instruction, 244—245 
fbne instruction, 244—245 
fclose function (C), 305, 306-307 
fcmoveq instruction, 246—247 
fcmovge instruction, 246-247 
fcmovgt instruction, 246-247 
fcmovle instruction, 246-247 
fcmovlt instruction, 246-247 
fcmovne instruction, 246-247 
Feature size, 383 
fetch instruction, 381 
fetch_m instruction, 381 
fgets function (C), 305, 307-308 
FIB1 function, 327—329 
FIB2 function, 330-331 
FIB3 function, 33 1-332 
Fibonacci numbers, 325-335 
Fields in assembly language statements, 50-51 
comments, 51 
labels, 51 
operators, 51 
specifiers, 51 
File name, 306 
File pointer, 306-307 
File storage 
logical, 288 
physical, 288 
File systems, 288, 289-290 
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File types, default naming of 

.exe (OpenVMS), 46 

.lis (OpenVMS), 47 

.m64 (OpenVMS), 46 

.map (OpenVMS), 47 

.obj (OpenVMS), 46 

.s (Unix), 46 
Flags for Unix commands, 47 
/float qualifier (OpenVMS), 357 
Floating-point control register (FPCR), 242, 243 
Floating-point instructions 

compared to integer instructions, 234-235 

operate group, 81 

register move, 390 
Floating-point numbers 

alignment desirable for, 34-35 

double precision, 33-36 

IEEE special values, 234-236 

IEEE standards, 33—36 

memory representation, 34—35 

register representation, 34-35 

S_ floating, 35-36 

single-precision, 33-36 

T_floating, 34 

VAX-compatible, 33 
Floating-point operate instructions, 81 
.fmask directive (Unix), 211 
fmov pseudo-instruction, 248 
fneg pseudo-instruction, 248 
fnop pseudo-instruction, 247, 338 
fopen function (C), 305, 306-307 
Formal parameter (OpenVMS), 259, 267 
Format 

Alpha instructions, 81-83 

assembly language statements, 50-51 
Format control string for I/O, 296, 308-309 
Formatted I/O, 308-309 
FORTRAN language 

argument passing, 202 

compiler output, 358-359 
Forwarding, 343, 346 
FPCR (floating-point control register), 242, 243 
fprintf function (C), 305, 308-309 
fputs function (C), 305, 307-308 
Fraction, 33 
.frame directive (Unix), 53, 211 
Frame pointer 

for OpenVMS, 212-213 

for Unix, 213-214 
fscanf function (C), 305, 308-309 
ftois insruction, 389-390 
ftoit insruction, 389-390 
FTP program, use of, 46, 402 
Full-screen mode (OpenVMS), 74 
Function code field 

integer instructions, 84—85 

floating-point instructions, 243 
Futurebus+, 384 
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-g flag (Unix), 47 
-g3 flag (Unix), 336 
GEM compilers, 355 
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generic keyword, 369 
gentrap (PALcode), 394 
getchar function (C), 164-165 
gets function (C), 294, 295 
gettimeofday function (Unix), 226 
Global pointer (Unix), 98—99, 208 
Global symbols, 53, 63, 122 
.globl directive (Unix), 53, 122 
go command (OpenVMS), 70 
goto instructions, 118 

Sgp register, 98, 208 

gp register, 98—99 
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halt (PALcode), 394 

Hazards, pipeline, 343-344 
data dependencies, 343 
procedural dependencies, 343 
resource conflicts, 343 

/hexadecimal qualifier (OpenVMS), 11 

Hidden bit, 34-36 

High-level languages, 43-44 
portability of, 44 
standardization of ,44 

Hints for exercises, 405 

host keyword, 369 

Hyphen character 
continuing a line (OpenVMS), 51 
introducing a flag (Unix), 47 
subtraction operator, 60 
unary negation operator, 61 
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/i display mode (Unix), 69 
-i flag (Unix), 71 
IBM 360, 1, 5 
IBM 801, 354 
IBM PC AT, 234 
Identifier (Unix), 56 
IEEE floating-point numbers 
denormal, 236 
unfinity, 236 
NaN (not a number), 236 
S_ floating representation, 35-36 
T_floating representation, 33 
X_floating representation, 382 
zero, 236 
.if directive (OpenVMS), 262-264 
.if_false directive (OpenVMS), 265 
.if_true directive (OpenVMS), 265 
.if_true_false directive (OpenVMS), 265 
.iif directive (OpenVMS), 265-266 
imb (PALcode), 394 
Immediate IF (OpenVMS), 265-266 
Imperative statements, 50 
Implementation 
defined, 1—2 
of the piano, 2-3 
implver instruction, 391-392 
In-line functions, 324, 370-373 
#include directive (Unix), 190-191 
.include directive (OpenVMS), 190-191 


Index 


include option (OpenVMS), 278 
Indefinite repeat block (OpenVMS), 259-261 
Indexed files (OpenVMS), 292 
Indirect addressing, 89 
Inexact result, floating-point, 242-243 
Infinity as IEEE number, 236 
Information units, 21—23, 30-32 

address, 21 

bytes, 30-32 

contents, 21 

longwords, 30-32 

quadwords, 30—32 

size, 21, 30 

words, 30-32 
-inline flag (Unix), 370 
inline option (OpenVMS), 370 
INLINE program, 371 
Input/output system, 24 
insbl instruction, 152-154 
Insert byte instruction, 152-154, 162-163 
inslh instruction, 152-154 
insll instruction, 152-154 
insqh instruction, 152-154 
insql instruction, 152-154 
/instr qualifier (OpenVMS), 69-70 
Instruction 

issue, 383 

pipelining, 341-349 

power, 323-324 

rewriting, 336-339 

size, 322-323 

tuning, 368—369 
Instruction architectures 

load/store, 28—29 

one-address, 27-28 

stack-based, 27 

three-address, 28 

two-address, 28 

zero-address, 27 
Instruction categories 

arithmetic, 25, 83 

comparative, 25 

control, 25 

data movement, 25 

load and store, 93—102 

logical, 25 

pseudo-, 30 
Instruction components 

operand specifiers, 25 

operation code (opcode), 25 
Instruction execution cycle, 25-26 
Instruction formats 

branch class, 81—82 

load and store class, 81-82 

operate class, 81-82 

PALcode class, 81—82 
Instruction subsets 

basic, 381—382 

bit count, 388-389 

byte and word, 384-386 

floating point move, 389-390 

motion video, 386-388 

square root, 389-390 
inswh instruction, 152—154 
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inswl instruction, 152-154 
int data type (C), 165 
Integer division, 214-216 
Integer operate instructions, 81 
Integer sizes 
byte, 32 
longword, 32 
quadword, 32 
word, 32 
Integers, 13-16 
representations of, 13—16, 32 
signed, 15—16, 32 
unsigned, 32 
Interface, user-visible, 1 
<ints.h> header file (OpenVMS), 223 
Invisible registers, 382 
IO_C program (OpenVMS), 165-166 
IO_C program (Unix), 167-169 
.irp directive (OpenVMS), 259-260 
.irpc directive (OpenVMS), 260-261 
ISAM files, 292 
Issue, multiple, 347—348 
itofs instruction, 389-390 
itoft instruction, 389-390 


J 


jmp instruction, 187, 192-193 

jsr instruction, 167-168, 187, 192-193 
jsr_coroutine instruction, 187, 196-197 
Jump instructions, 187, 192-193 

Jump tables, 193—195 
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Keyboard input, 293-295 
Keyword parameters (OpenVMS), 270-272 
Keywords 
for -arch and /architecture, 369 
for -tune and /optimize=tune, 369 
kind parameter (OpenVMS), 209 
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/1 qualifier (OpenVMS), 228 
Label field, 51 
Labels, 52 
global symbols, 122 
local symbols, 122 
macro-generated (OpenVMS), 275-276 
temporary symbols, 122-123 
Languages, computer 
IGL, 2GL, 3GL, 4GL, 3—4 
ANSIC, 4 
artificial intelligence, 4 
assembler, 3 
database access, 4 
high-level, 4 
machine, 3 
natural language, 4 
Pascal, 4 
Last-in first-out stack, 106 
Latency of Alpha instructions, 346? 
LC, location counter, 62—63 
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LCL attribute (OpenVMS), 280 
lda instruction, 93, 93—95, 95—96 
ldah instruction, 93, 93-95, 95—96 
ldbu instruction, 385 
ldgp pseudo-instruction, 52 
1d1 instruction, 52, 93, 96—97 
1d1_1 instruction, 93 
1dq_ instruction, 52, 93, 96-97 
1dq_1 instruction, 93 
ldq_u instruction, 93, 97-98, 147, 338 
lds instruction, 236-237 
ldt instruction, 236-237 
1dwu instruction, 385 
Length of string parameter (OpenVMS), 273 
Libraries in linking process, 47 
library option (OpenVMS), 278 
LIFO stack, 106, 137 
Line continuation (OpenVMS), 51 
%line keyword (OpenVMS), 70 
Line numbers, in listing file, 62—63 
Line size of a cache, 344 
Line terminators in text files, 308 
link command (OpenVMS), 11, 47 
Link map, 66—67, 100 
Linkage section (OpenVMS), 205, 207 
Linkers, 43, 65 
functions of, 65—67 
Linux operating system, 403 
.lis file type (OpenVMS), 47 
.list directive (OpenVMS), 277 
/list qualifier (OpenVMS), 11, 47 
Listing file, 11, 47, 54, 62-63, 65 
controlling, 277-278 
line numbers, 62—63 
location counter, 62—63 
symbol table, 63 
Listing level (OpenVMS), 278 
.1it8 directive (Unix), 92, 98-99 
Literal addressing, 83, 87 
Little-endian convention, 31, 160—163 
Load address instructions, 93, 93-95, 95-96 
Load instructions 
byte and word, 384-386 
floating-point, 236-237 
integer, 93, 96—98 
Load and store class of instructions, 81-82, 93—102 
Load/store architecture, 28—29 
Local label block (OpenVMS), 123 
local parameter (OpenVMS), 227 
Local section (OpenVMS), 280 
Local symbols, 122-123 
Local variables, 212, 11C 
Locality, 122-123 
Location counter, 54, 58—60 
. character as symbol for, 59 
as an offset, 59 
in listing file, 62-63 
incrementation of, 58—59 
multiple instances, 59 
program sections, 279 
Logical data, 128-131 
Logical difference, 130-131 
Logical equivalence, 130-131 
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Logical functions, 128-132 
binary, 129 
unary, 129 
Logical instructions, 25, 130-132 
Logical mask, 131 
Logical product, 130 
Logical shift, 132 
Logical sum, 130-131 
LONG alignment (OpenVMS), 279 
. Long directive, 57 
/ Long qualifier (OpenVMS), 69 
Longword, 11 
Loop design and efficiency, 123-124 
Loop unrolling, 324, 364 
Lower case usage, 47-48 
Lower case, converting to upper, 131 
$ls symbol, 65 
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.m64 file type (OpenVMS), 10, 46 
-machine_code flag (Unix), 366 
/machine_code qualifier (OpenVMS), 357 
<machine/pal.h> header file (Unix), 395 
macro command (OpenVMS), 11, 47 
.macro directive (OpenVMS), 267-268 
MACRO-64 assembler, 46 
Macros (OpenVMS), 53, 266-275 
actual parameters, 268 
default values, 270-272 
defining, 267—268 
formal parameters, 267 
invoking, 268-269 
keyword parameters, 270-272 
names, 267—268 
positional parameters, 269-270 
recursive, 276 
self-redefining, 276 
string parameters, 272-275 
Maintainability, writing for, 75-76 
manual keyword, 370 
Map file, 47, 66—67 
.map file type (OpenVMS), 11, 47 
/map qualifier (OpenVMS), 11, 47 
MAP statement (BASIC), 281 
Mask, logical, 131 
Mask byte instruction, 154-156, 163 
.mask directive (Unix), 167—168, 211 
Masking 
ext instruction, 146, 161-162 
ins instruction, 153, 162—163 
msk instruction, 155, 163 
MAX function, 125 
MAXIMUM program, 126-128 
maxsb8 instruction, 387—388 
maxsw4 instruction, 387—388 
maxub8 instruction, 387—388 
maxuw4 instruction, 387—388 
mb instruction, 381 
.mdelete directive (OpenVMS), 276 
me option (OpenVMS), 278 
meb option (OpenVMS), 278 
Memory, 21-23 
access time, 323 
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as an array, 22 
byte-addressable, 23 
holding instructions and data, 21 
information units, 21—23 
word size, 23 
Memory direct addressing, 88, 106-107 
Memory indirect addressing, 88, 107 
Memory-memory architecture, 29 
-mexit directive (OpenVMS), 268, 276-277 
mf_fpcr instruction, 381 
minsb8 instruction, 387—388 
minsw4 instruction, 387—388 
minub8 instruction, 387-388 
Minus sign character 
continuing a line (OpenVMS), 51 
introducing a flag (Unix), 47 
subtraction operator, 60 
unary negation operator, 61 
minuw4 instruction, 387-388 
MIPS compilers, 354-355 
MIX attribute (OpenVMS), 280 
Mixed section (OpenVMS), 280 
Module name, 62 
MONEY macro, 281-283 
Motion video extension, 386-388 
Motorola 680x0, 107, 234 
mov pseudo-instruction, 52, 95—96, 99, 101 
mskb1 instruction, 154—156 
msk1h instruction, 154-156 
msk11 instruction, 154—156 
mskqh instruction, 154—156 
mskql instruction, 154-156 
mskwh instruction, 154-156 
mskw1 instruction, 154—156 
mt_fpcr instruction, 381 
mull instruction, 83-85 
mulq instruction, 83-85 
muls instruction, 240-241 
mult instruction, 240-241 
Multiple-issue effects, 347—348 
Multiplication 
by a constant, 348-349 
extended precision, 86 
floating-point, 240-241 
integer, 83-85 
unsigned, 86 
Multiply-defined symbols, 64, 122 
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name parameter (OpenVMS), 209, 227 
NaN (not a number), 236 
NAND function, 129 
-narg directive (OpenVMS), 270 
Natural alignment, 97-98 
.nchr directive (OpenVMS), 273 
Nesting 
of angle brackets (OpenVMS), 61 
of macros (OpenVMS), 268 
of parentheses (Unix), 61 
.nlist directive (OpenVMS), 277 
nm command (Unix), 66, 98 
No-op instructions, 338 
NOEXE attribute (OpenVMS), 280 
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noinline keyword (OpenVMS), 370 
NOMIX attribute (OpenVMS), 280 
-non_shared flag (Unix), 341 
none keyword, 370 
/nooptimize qualifier (OpenVMS), 357 
nop pseudo-instruction, 125—126, 338 
NOPIC attribute (OpenVMS), 280 
NOR function, 129, 131 
-noshow directive (OpenVMS), 277 
NOSHR attribute (OpenVMS), 280 
Not a number (IEEE), 236 
NOT operation, 129 
NOWRT attribute (OpenVMS), 280 
Null character, terminating a string, 57 
Null frame procedures, 204 
Number of addresses, 27-29 
Number conversion, 133—135, 137-137 
Number sign character 

beginning comments (Unix), 51 

in operand field (OpenVMS), 100-101 
Number systems, 12—16 

binary, 12-13 

decimal, 12-13 

floating-point, 33-36 

IEEE floating-point, 33-36 

octal, 12-13 

one’s complement, 15 

hexadecimal, 12—13 

integers, 12-16 

sign and magnitude, 15 

signed integers, 15-16 

two’s complement, 15-16 
Numeric ranges 

floating-point numbers, 33 

integers, 32 
NUMOUT subroutine, 188-192 
NXOR function, 129 
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-o flag (Unix), 47 
^o unary operator (OpenVMS), 55 
-00 flag (Unix), 47, 366 
-O1 flag (Unix), 336 
-03 flag (Unix), 336, 366 
.obj file type (OpenVMS), 46 
Object file, 43, 46, 54, 65 
OCTA alignment (OpenVMS), 279 
Octaword, 30, 168 
Offset 
in branch instructions, 114-115 
in displacement addressing, 89 
-om flag (Unix), 341, 374 
On-chip cache, 382-384 
One-address instruction set, 27—28 
One’s complement representation, 15 
opc01 through opc07 opcodes, 393 
Opcodes, 25, 53, 81 
Open routines, 266 
OpenVMS 
DCL commands, 7, 10-11 
I/O software, 291—292 
operating system, 6 
Operand specifiers, 25, 51 
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Operate class of instructions, 81—82 
Operating systems 

OpenVMS, 6 

RSTS/E, 6 

RSX, 6 

RT-11, 6 

Unix, 6 

Windows NT, 6 
Operation codes, 53, 81 
Operator field, 51 
Operators 

assembly language, 52-54 

binary, 60 

precedence of, 61 

unary, 60-61 
Optimization 

and debugging, 374 

dynamic, 373-374 

enabling of, 335-341, 362, 366 

implementation-dependent, 365-370 

in-line functions, 370-373 

inhibition of, 47, 357 

post-compilation, 373-374 

static, 373—374 

tuning, 365-366 
/optimize qualifier (OpenVMS), 341, 357 
OR function (logical sum), 128—130 
or (synonym for bis instruction), 132 
ornot instruction, 130-131 
ots$div_i routine (OpenVMS), 215 
ots$div_1 routine (OpenVMS), 215 
ots$div_ui routine (OpenVMS), 215 
ots$div_ul routine (OpenVMS), 215 
ots$rem_i routine (OpenVMS), 215 
ots$rem_l1 routine (OpenVMS), 215 
ots$rem_ui routine (OpenVMS), 215 
ots$rem_ul routine (OpenVMS), 215, 218 
Out of order issue, 383 
Output dependency, 343 
Overflow 

floating-point, 242-243 

integer, 86 

with shift instructions, 133 
Overlaid section (OpenVMS), 280 
OVR attribute (OpenVMS), 280 
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Packing data not encouraged, 76 
.page directive (OpenVMS), 277 
<pal.h> header file (Unix), 395 
PAL_ naming Unix), 395 
PAL _rduniq (Unix), 395 
PAL _wruniq (Unix), 395 
pal19, pal1B through pal1F opcodes, 393 
PALcode, 30, 392-395 
specific to operating systems, 395 
universal, 393-395 
PALcode class of instructions, 81—82 
PALmode environment, 393 
Parameters (OpenVMS macros) 
actual, 259 
formal, 259 
Parentheses, clarifying precedence (Unix), 61 
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Pascal language 
argument passing, 202 
compiler output, 358-359 
Passing arguments, 200—203 
for BASIC (OpenVMS), 202 
for C (OpenVMS), 202 
for C (Unix), 202 
for COBOL (OpenVMS), 202 
by descriptor, 201 
for FORTRAN (OpenVMS), 202 
by immediate value, 201 
methods, 201-202 
for Pascal (OpenVMS), 202 
by reference, 201 
pc command (Unix), 400 
PC register, 21, 26-27 
pca56 keyword, 366 
PCI bus, 24, 384 
pd command (Unix), 70 
PDP-11 architecture, 5—6, 29, 80 
PDP-11 emulation, 381 
Pentium as a CISC architecture, 6 
Performance considerations, 321—322 
Period character 
introducing a directive, 54 
symbol for location counter, 59 
in symbols, 56 
perr instruction, 387—388 
perror function (C), 294, 317 
Piano architecture, 2 
Piano implementation, 3 
PIC attribute (OpenVMS), 280 
Pipeline bubbles, 343, 373 
Pipeline hazards, 343-344 
data dependencies, 343 
procedural dependencies, 343 
resource conflicts, 343 
Pipeline stages, 342-343, 383 
Pipeline stalling, 343 
Pipelining, 341-349 
Pixel error, 387 
pklb instruction, 387—388 
pkwb instruction, 387—388 
Plus sign character 
addition operator, 60 
unary positive operator, 61 
Pointers, 93—95 
Pop item from a stack, 106 
Portability of programs, 44 
Position-independent content section (OpenVMS), 280 
Positional coefficients, 12 
Positional parameters (OpenVMS), 269-270 
Post-compilation optimization, 373-374 
PowerPC, 348 
Precedence of operators, 61 
Prediction, branch, 343, 345 
Prefetching, 345 
Prefix, unary, 55 
Pre-indexing, 152 
printf function (C), 294, 295-296 
Privileged architecture library, 392-395 
Procedural dependencies, 343 
Procedure descriptor, 205-206 
PROCEDURE keyword (Pascal), 219 
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Procedure value, 205 
Procedures 
heavy weight, 204 
light weight, 204 
null frame, 204 
register frame, 204 
stack frame, 204 
Producer—consumer effects, 345-347 
Program counter, 21, 26-27 
Program optimization, 322-325 
Program section, 100, 278-281 
Program size, 324 
Programming environments, 5 
tools, 44—47 
Programs 
APPROXPI, 250-254 
BACKWARD, 169-171 
BADSYMB (OpenVMS), 401 
chrget (function), 164—165, 165—166, 167-169 
chrput (function), 164-165, 165-166, 167-169 
COM_C, 356 
COM_F, 356 
COM_P, 356 
DECNUM, 137-139 
DECNUM2, 183-185 
DOT_3, 102_105 
DOT_N, 119-122 
FIB1 function, 327—329 
FIB2 function, 330-331 
FIB3 function, 331-332 
INLINE, 371 
IO_C (OpenVMS), 165-166 
IO_C (Unix), 167-169 
MAXIMUM, 126-128 
MONEY (macro), 281-283 
NUMOUT (subroutine), 188—192 
RADIX, 90-92, 98-101 
RADIX2, 133-135 
RANDBAS (OpenVMS), 222 
RANDC, 223 
RANDCOB, 223-224 
RANDFOR, 223 
RANDFUNC function (OpenVMS), 220-224 
RANDFUNC function (Unix), 224-227 
RANDOM procedure or function, 216-227 
RANDPAS, 222 
RANDPROC procedure (OpenVMS), 217-220 
SCANFILE, 309-312 
SCANTERM, 296-300 
SCANTEXT, 150-152 
SORTINT, 313-317 
SORTSTR, 300-304 
SQUARES, 7-11 
SQUARES?, 48-49 
TESTFIB, 333-335 
TESTNUM, 190-192 
TESTPROC (OpenVMS), 219-220 
Prologue, standard 
for OpenVMS, 210 
for Unix, 211 
.prologue directive (Unix), 53 
Psect, 100, 278-281 
.psect directive (OpenVMS), 279 
Pseudo-instructions, 30, 247, 248 
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Pseudo-operations, 53 
Pseudo-random numbers, 216 
Push item onto a stack, 106 
putchar function (C), 164-165 
puts function (C), 294, 295 

px command (Unix), 70 
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/q qualifier (OpenVMS), 228 
QIO access to files (OpenVMS), 291-292 
QUAD alignment (OpenVMS), 279 
.quad directive, 57 
Quadword, 11, 30 
Quadword alignment, 30 
Qualifiers for OpenVMS commands, 47 
Question mark, generating a label (OpenVMS), 275 
quit command, 70 
Quotation marks 
enclosing string data, 57 
enclosing string parameters (OpenVMS), 272 
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r access mode, 307 
Radix control, 55 
RADIX program, 90-92, 98-101 
RADIX? program, 133-135 
RANDBAS program (OpenVMS), 222 
RANDC program, 223 
RANDCOB program (OpenVMS), 223-224 
RANDFOR program, 223 
RANDFUNC function (OpenVMS), 220-224 
RANDFUNC function (Unix), 224-227 
Random numbers, 216 
RANDOM procedure or function, 216-227 
RANDPAS program, 222 
RANDPROC procedure (OpenVMS), 217-220 
rc instruction, 381 
RD attribute (OpenVMS), 280 
rdunique (PALcode), 394-395 
read_ungq PALcode (OpenVMS), 395 
Readable section (OpenVMS), 280 
Record structures, 101—102 
Recursion, 325-335 
Recursive macros (OpenVMS), 276 
Register direct addressing, 88 
Register frame procedures, 204 
Register indirect addressing, 88, 105-107 
Register indirect deferred addressing, 106-108 
Register move instructions, 389-390 
Register naming 
dbx debugger (Unix), 70 
MACRO-464 assembler (OpenVMS), 49 
OpenVMS debugger, 70 
Unix assembler, 49 
Register renaming, 343—344 
Register-level programming, 21 
Register-register architecture, 29 
Registers 
conventional use, 197—199 
floating-point, 21 
function values, 199 
integer, 21 
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invisible, 382 
for passing arguments, 199, 200 
saved, 198—199 
scratch, 198—199 
REL attribute (OpenVMS), 280 
Relative addressing, 106-107 
Relative deferred addressing, 107 
Relative files (OpenVMS), 292 
Relocatable section (OpenVMS), 280 
Relocatable symbols, 63 
Remainder, in division, 136 
reml pseudo-instruction (Unix), 215-216 
remlu pseudo-instruction (Unix), 215-216 
remq pseudo-instruction (Unix), 215-216 
remqu pseudo-instruction (Unix), 215-216, 224 
Reordering of instructions, 339-341 
Repeat blocks (OpenVMS) 
indefinite, 259-261 
simple, 258-259 
.repeat directive (OpenVMS), 258-259 
Representation of numbers, 12-16 
integers, 12—16 
one’s complement, 15 
sign and magnitude, 15 
signed integers, 15—16 
two’s complement, 15-16 
Resource conflicts, 343 
Resources for reference, 399-404 
ret instruction, 52, 187 
$return macro (OpenVMS), 53, 210-211 
Rewriting of instructions, 336-339, 348-349 
RISC architecture, 6, 354—355 
RISC instructions, power of, 323-324 
RMS files (OpenVMS), 292 
Rounding, 241-242 
Sroutine macro (OpenVMS), 53, 65, 209-210 
rpcc instruction, 381 
rs instruction, 381 
RS/6000, 348 
RSTS/E operating system, 6 
RSX operating system, 6 
RT-11 operating system, 6 
run command 
OpenVMS, 47 
Unix debugger, 70 
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.s file type (Unix), 46 

/s qualifier (OpenVMS), 228 
.s_floating directive, 57 

S_ floating number, 35-36, 237 
s4add1 instruction, 85, 108 

s4addq instruction, 85, 108—109 
s4subl1 instruction, 85, 108 

s4subq instruction, 85, 108 

s8addl1 instruction, 85, 108 

s8addq instruction, 85, 108 

s8subl1 instruction, 85, 108 

s8subq instruction, 85, 108 
save_fp parameter (OpenVMS), 210 
Save_ra parameter (OpenVMS), 210 
Saved registers, 199-199 
Saved_regs parameter (OpenVMS), 165—166, 210 
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.sbttl directive (OpenVMS), 62, 277 
Scaled arithmetic instructions 
addition, 85,108—109 
subtraction, 85,108—109 
scanf function (C), 294, 295-296 
SCANFILE program, 309-312 
SCANTERM program, 296-300 
SCANTEXT program, 150-152 
Scratch registers, 198—199 
scrach_regs parameter (OpenVMS), 227-228 
Screen mode (OpenVMS), 74 
search command (OpenVMS), 69 
Self-redefining macros (OpenVMS), 276 
Semicolon character, beginning comments (OpenVMS), 
51 
Sequential files (OpenVMS), 292 
set command (OpenVMS) 
set break, 70 
set trace, 70, 73 
set watch, 70, 73 
.set directive (Unix), 53 
noreorder, 339 
reorder, 339 
sextb instruction, 385-386 
sextl pseudo-instruction, 397 
sextw instruction, 385-386 
Sharable section (OpenVMS), 280 
Shift instructions 
arithmetic, 132-133 
logical, 132 
Shifting 
ext instructions, 145-146, 161-162 
ins instructions, 153. 6G2 
msk instructions, 155, 163 
. show directive (OpenVMS), 277 
show window command (OpenVMS), 74 
SHR attribute (OpenVMS), 280 
Sign and magnitude representation 
floating-point numbers, 34—36 
integers, 15 
Sign extend instructions, 385-386 
Signed integers, 15—16 
size keyword (Unix), 370 
Slash character 
in comment delimiter /* . . . * / (Unix), 51 
division operator, 60 
sll instruction, 132-133 
Software emulation, 380-381 
Sorting 
integers, 213-317 
strings, 300-304 
SORTINT program, 313-317 
SORTSTR program, 300-304 
-source_listing flag (Unix), 366 
Source program, 43 
SP register, 180-181 
SP (OpenVMS), 182 
Ssp (Unix), 182 
Space character, use in statements, 51 
Special values (IEEE), 235-236 
Specifier field, 51 
speed keyword (Unix), 370 
Spike (Windows NT), 374 
sqrts instruction, 389-390 
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sqrtt instruction, 389—390 
Square root instruction, 390 
SQUARES program, 7—11 
SQUARES? program, 48—49 
sra instruction, 132 
srl instruction, 132 
Stack, last-in first-out, 106 
Stack, user-defined, 182-183 
Stack addressing, 180-181 
Stack frame procedures, 204 
Stack organization 
for OpenVMS, 212-213 
for Unix, 213-214 
Stack pointer, 106 
Stack-based instruction set, 27 
Stacks, 180-183 
Stalls, data, 344 
Standard C library, 293 
Standard input, 293 
Standard output, 293 
Standard prologue and epilogue 
for OpenVMS, 209-211 
for Unix, 211 
standard_prologue parameter (OpenVMS), 210 
Starting address, 26-27 
Statement, direct assignment, 55 
Statement format, assembly language, 50-51 
Statement types, assembly language, 50 
control, 50, 61—62 
declarative, 50 
imperative, 50 
Static initialization, 281 
stb instruction, 385 
stderr, 293 
stdin, 293 
<stdio.h> header file, 164, 294 
stdout, 293 
step command (OpenVMS), 70 
stepi command (Unix), 70 
Stepwise development, 46—47 
Stepwise execution, 69 
stl instruction, 52, 93 
stl_c instruction, 93 
stopi command (Unix), 70, 73 
Storage allocation, 43, 56-58, 355 
Store instructons 
byte and word, 384-386 
floating-point, 236-237 
integer, 93, 96—98 
String concatenation (OpenVMS), 274 
String descriptor (OpenVMS), 202-203 
String parameters (OpenVMS), 272-275 
Strings, 38 
stq instruction, 52, 93 
3tq_c instruction, 93 
stq_u instruction, 93, 97-98, 156 
sts instruction, 236—237 
3tt instruction, 236—237 
stw instruction, 385 
subl instruction, 52, 83-85 
subq instruction, 52, 83-85 
subs instruction, 240-241 
subt instruction, 240-241 
\ubroutine calls, 195—196 
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Subroutine instructions, 185—188 
Subroutine linkage, 185-188 
Superfamily concept, 5—6 
Superscalar parallelism, 29-30, 343, 383 
swppal (PALcode), 394 
Symbol table, 54, 63, 64—65, 65—67 
Symbols, 56, 64 

absolute, 63 

external, 64, 65 

global, 53, 63, 64, 122 

local, 122 

multiply-defined error, 64, 122 

relocatable, 63 

temporary, 122-123 

undefined, 64, 65 
Symbolic addresses, 51-52 
Symbolic assembler, 54 
Symbolic debugger, 68-74 
sys$gettim routine (OpenVMS), 217-218 
<sys/time.h> header file (Unix), 226 
System calls for I/O 

OpenVMS, 291-292 

Unix, 291 
System libraries in linking process, 47 
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/t qualifier (OpenVMS), 228 
.t_floating directive, 57 
T_floating number, 34, 237 
Tab character, use in statements, 51 
tcsh shell (Unix), 400 
Telnet client software, 402 
Temporary labels, 122—123 
Terminal I/O, 293-295 
Terminators, line, 308 
Terms, 60 
TESTFIB program, 333-335 
TESTNUM program, 190-192 
TESTPROC program (OpenVMS), 219-220 
. text directive (Unix), 53 
Text file I/O, 304—309 
Three-address instruction set, 28 
Tilde character, binary complement operator (Unix), 60 
Time system function 
gettimeofday (Unix), 226 
sys$gettim (OpenVMS), 217-218 
<time.h> header file (Unix), 226 
.title directive (OpenVMS), 53, 61-62, 277 
tracei command (Unix), 70, 73 
Tracepoints, 69, 73 
Trademark information, T 
Transfer address (OpenVMS), 62 
trapb instruction, 381 
True-false sense switching (OpenVMS), 265 
-tune flag (Unix), 366, 368 
tune option (OpenVMS), 368 
Tuning, 366, 368-369 
Two-address instruction set, 28 
Two-pass assembler, 64—65 
Two’s complement representation, 15—16 


Typed language, 58 
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UltraSparc architecture, 386 
umulh instruction, 86, 136-137 
Unaligned data, 146-150, 156-160 
Unaligned load 

byte, sign-extended, 150 

byte, zero-extended, 150 

longword, sign-extended, 149 

longword, zero-extended, 148 

quadword, 147—148 

word, sign-extended, 149-150 

word, zero-extended, 149 
Unaligned store 

byte, 160 

longword, 159 

quadword, 159 

word, 160 
Unary logical functions, 129 
Unary operators 

arithmetic, 61 

logical, 61 

radix control, 61 
Unary prefix, 55 
Unbiased rounding, 241—242 
Unconditional branch, 118-119 
Underflow, floating-point, 242-243 
Underscore character, 56 
Unformatted line I/O, 295, 307-308 
Universal intermediate representation, 354-355 
Universal PALcode instructions, 393-395 
Unix 

T/O software, 291 

on-line documentation, 400—401 

operating system, 6 
unop pseudo-instruction, 338 
unpkb] instruction, 387-388 
unpkbw instruction, 387-388 
Unrolling of loops, 324, 364 
unsigned64 data type (Pascal), 219-220 
Upper case, converting to lower, 131 
Upper case usage, 47-48 


V 
/v qualifier, 85, 86 
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v suffix, 85, 86 

Value of a number, 12 

VAX architecture, 6, 29, 80 

VAX-compatible floating-point instructions, 382 
Vector bit min/max instructions, 387-388 

Version number (OpenVMS), 307 

Vertical bar character, logical OR operator (Unix), 60 
Virtual addresses, 29—30 

VLM64 servers, 23 
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w access mode, 307 

Watchpoints, 69, 73-74 

wh64 instruction, 381 

when condition (Unix), 70 

Weights of digits, 12 

Windows NT operating system, 6, 384-385 
wmb instruction, 381 

Word, 11 

WORD alignment (OpenVMS), 279 
.word directive, 57 

Writable section (OpenVMS), 280 
write_unq PALcode (OpenVMS), 395 
Writing programs, conventions for, 75—76 
WRT attribute (OpenVMS), 280 
wrunique (PALcode), 394-395 
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/X isplay mode (Unix), 69 

-x flag (Unix), 66 

^x unary operator (OpenVMS), 55 
X_floating representation, 382 

xor instruction, 130-131 

/xx display mode (Unix), 69 
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zap instruction, 173—174 
zapnot instruction, 173—174 
Zero as IEEE number, 236 
Zero bytes instruction, 173—174 
Zero-address instruction set, 27 
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normal use for a period of thirty (30) days from the date of your purchase. Your only remedy and the 
Company’s only obligation under these limited warranties is, at the Company’s option, return of the 
warranted item for a refund of any amounts paid by you or replacement of the item. Any replace- 
ment of SOFTWARE or media under the warranties shall not extend the original warranty period. 
The limited warranty set forth above shall not apply to any SOFTWARE which the Company deter- 
mines in good faith has been subject to misuse, neglect, improper installation, repair, alteration, or 
damage by you. EXCEPT FOR THE EXPRESSED WARRANTIES SET FORTH ABOVE, THE COM- 
PANY DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMI- 
TATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A 
PARTICULAR PURPOSE. EXCEPT FOR THE EXPRESS WARRANTY SET FORTH ABOVE, THE 
COMPANY DOES NOT WARRANT, GUARANTEE, OR MAKE ANY REPRESENTATION 
REGARDING THE USE OR THE RESULTS OF THE USE OF THE SOFTWARE IN TERMS OF ITS 
CORRECTNESS, ACCURACY, RELIABILITY, CURRENTNESS, OR OTHERWISE. 


IN NO EVENT, SHALL THE COMPANY OR ITS EMPLOYEES, AGENTS, SUPPLIERS, OR 
CONTRACTORS BE LIABLE FOR ANY INCIDENTAL, INDIRECT, SPECIAL, OR CONSEQUEN- 
TIAL DAMAGES ARISING OUT OF OR IN CONNECTION WITH THE LICENSE GRANTED 
UNDER THIS AGREEMENT, OR FOR LOSS OF USE, LOSS OF DATA, LOSS OF INCOME OR 
PROFIT, OR OTHER LOSSES, SUSTAINED AS A RESULT OF INJURY TO ANY PERSON, OR LOSS 
OF OR DAMAGE TO PROPERTY, OR CLAIMS OF THIRD PARTIES, EVEN IF THE COMPANY OR 
AN AUTHORIZED REPRESENTATIVE OF THE COMPANY HAS BEEN ADVISED OF THE POSSI- 
BILITY OF SUCH DAMAGES. IN NO EVENT SHALL LIABILITY OF THE COMPANY FOR DAM- 
AGES WITH RESPECT TO THE SOFTWARE EXCEED THE AMOUNTS ACTUALLY PAID BY 
YOU, IF ANY, FOR THE SOFTWARE. 


SOME JURISDICTIONS DO NOT ALLOW THE LIMITATION OF IMPLIED WARRANTIES 
JR LIABILITY FOR INCIDENTAL, INDIRECT, SPECIAL, OR CONSEQUENTIAL DAMAGES, SO 
[HE ABOVE LIMITATIONS MAY NOT ALWAYS APPLY. THE WARRANTIES IN THIS AGREE- 
MENT GIVE YOU SPECIFIC LEGAL RIGHTS AND YOU MAY ALSO HAVE OTHER RIGHTS 
NHICH VARY IN ACCORDANCE WITH LOCAL LAW. 


ACKNOWLEDGMENT 


YOU ACKNOWLEDGE THAT YOU HAVE READ THIS AGREEMENT, UNDERSTAND IT, 
AND AGREE TO BE BOUND BY ITS TERMS AND CONDITIONS. YOU ALSO AGREE THAT THIS 
\GREEMENT IS THE COMPLETE AND EXCLUSIVE STATEMENT OF THE AGREEMENT 
'ETWEEN YOU AND THE COMPANY AND SUPERSEDES ALL PROPOSALS OR PRIOR AGREE- 
AENTS, ORAL, OR WRITTEN, AND ANY OTHER COMMUNICATIONS BETWEEN YOU AND 
HE COMPANY OR ANY REPRESENTATIVE OF THE COMPANY RELATING TO THE SUBJECT 
AATTER OF THIS AGREEMENT. 


Should you have any questions concerning this Agreement or if you wish to contact the Com- 
any for any reason, please contact in writing at the address below. 


Robin Short 

Prentice Hall PTR 

One Lake Street 

Upper Saddle River, New Jersey 07458 





About the CD-ROM 


The CD-ROM that accompanies this book contains the source code for all of the illus- 
trative programs, starting with the SQUARES example in Chapter 1 and continuing all 
the way to the explorations of compiler output in Chapter 13. 


The CD-ROM has been manufactured to have partitions that present the files 
appropriately to either a Macintosh or a Windows desktop client system. On a Macin- 
tosh, the files can be viewed with BBEDIT (see Suggested Resources), SimpleText, or a 
word processor. On a Windows system, the files can be viewed with WordPad or a word 
processor, as well as with the TYPE command at the DOS prompt. 


Files that are independent of operating system are at the top directory level. Those 
files include programs in various high level languages as well as miscellaneous test data 
files. Assembly language files specific to MACRO-64 are in an OpenVMS subdirectory 
(folder). Similarly, files specific to the UNIX assembler for the Alpha are in a UNIX 
subdirectory (folder). 


Technical Support 


Prentice Hall does not offer technical support for this software. However, if there is a 
problem with the media, you may obtain a replacement copy by emailing us with your 
problem at: 


discexchange@phptr.com 
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ALPHA’ RISC ARCHITECTURE 


JAMES S. EVANS 
RICHARD H. ECKHOUSE 


With Alpha RISC Architecture for Programmers, you can master the fundamentals of 
computer architecture and assembly language programming in the context of one of 
the world’s most advanced high-performance processors: the 64-bit Alpha. The book 
introduces assemblers, debuggers, instruction formats, addressing, branch instructions, 
logical operations, and many other key fundamentals of processor architecture. It 
delivers real-world guidance for solving practical programming a with 
extensive runnable sample code. Coverage includes: 


© Working with bytes in a load/store architecture 
e Subroutines, procedures, and floating-point operations 
e Conditional assembly and macros 
e Text 1/0, including UNIX and OpenVMS implementations 
e Run-time environment support for both high-level and low-level languages 
_ © Writing subprograms and linking higher-level modules with assembly 
language modules 


Learn how processor architecture impacts performance, including the roles of instruction 
size, addressing mode, instruction power, program size, in-line functions, recursion, 
pipelining, compilers, and post-compilation optimization. Master essential debugging 
techniques, and review the latest features of the Alpha architecture, including the 


Motion Video Extension and Privileged Architecture Library. The book includes extensive 
references to Alpha resources for Windows® NT, UNIX®, and OpenVMS. It will be an 
invaluable resource for hardware engineers, programmers, and students alike. 


ALOU he AUENA 


JAMES S. EYANS is a Professor at Lawrence University in Appleton, WI, where he also 
serves as Director of Information Technology Planning. He has taught assembly language, 
computer architecture and computer hardware organization, and co-founded WiscNet, which 
provides Internet connectivity to nearly all educational institutions throughout Wisconsin. 


RICHARD H. ECKHOUSE is a Professor at the University of Massachusetts, Boston, 
and Vice President and co-founder of MOCO, inc., a biomedical research firm. He. has held 
management positions at DEC, and is currently an elected member of the Board of Governors 
of the IEEE Computer Society. 
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